scverse / mudata

Multimodal Data (.h5mu) implementation for Python
https://mudata.rtfd.io
BSD 3-Clause "New" or "Revised" License
72 stars 17 forks source link

Combining MuData's – concat function #20

Closed ivirshup closed 2 months ago

ivirshup commented 2 years ago

It's possible I'm not seeing them, but should there be concat (like anndata.concat) functionality here?

Maybe also merge (like https://github.com/theislab/anndata/issues/658)? But that could be a separate issue.

cc36 commented 2 years ago

Hello,

I wanted to ask what is the best way to combine several samples in a MuData object and it seems like this existing issue points in that direction.

The approach I usually take for combining multiple AnnData object does not seem to work here:

holder = []

for n in folders:
    holder.append(mu.read_10x_h5("/home/jovyan/data/Multiome/DNAP/"+n+"/filtered_feature_bc_matrix.h5"))

adata = holder[0].concatenate(holder[1:], join='outer', index_unique=None)

Any help would be highly appreciated.

Thanks!

ivirshup commented 2 years ago

I think the general approach would be to deconstruct the MuData into its constituent AnnData's, concatenate those with anndata.concat, and then put those into a new MuData.

@bio-la, did you have a function working here that you could share?

cc36 commented 2 years ago

Thanks. I have tried the approach suggested, i.e. deconstructing into the constituent AnnData objects and concatenating those and it works well except that the AnnData.uns['files'] and AnnData.uns['atac'] information is lost in the concatenation.

I have tried using the uns_merge argument from the AnnData.concatenate function (https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.concatenate.html#anndata.AnnData.concatenate) but it does not seem to help in this case.

Do you have any suggestion for this?

Thank you in advance!

ivirshup commented 2 years ago

I think this gets a bit more complicated. I'm unsure if there's going to be a good way to do this that plays well with muon.atac, though @mffrank or @gtca would be able to comment better.

I'm assuming you want to use the data in those fields downstream. How would you want those fields to be merged?

bio-la commented 2 years ago

@cc36 why are you trying to concatenate multiple atac anndata/mudata? I'm assuming you are talking about the atac.uns.xxx slots that are filled with fragments and peaks files by reading any single multiome 10x run with mu.read_10x_h5, but unless you have called peaks together on the original samples it doesn't make sense to concatenate peaks and files from separate folders. i am not sure what would be the analytical tool that lets you call peaks from multiple samples using the same background fragment distribution and still output separate 10x-folders (samples). normally at the end of the aggregation step (joint peak calling) you would have one count matrix, one fragment matrix, one peak file and so on.

so, the behaviour you describe (losing those peaks and fragment files) is actually preventing you from doing something that would give you a false peak distribution per sample. it may be that I'm missing something here, could you please expand on what exactly are you trying to do by concatenating multiple atac (multiome) anndata/mudata? thanks!

cc36 commented 2 years ago

@bio-la Thanks for your reply. You are right, I need to use the joint peak calling output, which I have not done and will now do. You can resolve this issue. Thanks a lot for your help!

Zethson commented 2 years ago

(Fat fingers, sorry)

sruthi-hub commented 1 year ago

I am new to working with scATACseq. Would appreciate if @cc36 @bio-la @ivirshup one of you could share a few lines of code that ensures that there's no false peak distribution. Thanks!

gtca commented 1 year ago

@sruthi-hub Hey, if this question is still relevant, could you elaborate on what the false peak distribution actually means? If this is about peak properties, they can be quantified and visualised as for instance shown in this tutorial.

ChaseTaylor939 commented 1 year ago

I'm having a similar issue when I try to concatenate two different multiome datasets. The RNA concatenates just fine, but the ATAC loses lots of metadata when I concatenate and the n_vars goes down to 13. I'm sorry, but I do not understand what @bio-la meant in their earlier explanation. Could someone provide some code on how they combine two or more multiome datasets?

Thank you!

9164-CT-1_Integration_01

Inked Multiome_ATAC_Concat_02

gtca commented 1 year ago

Hey @ChaseTaylor939,

Concatenation is performed as described with inner join (for features) by default:

mod1 = AnnData(np.random.normal(size=(10,5)))
mod2 = AnnData(np.random.normal(size=(10,3)))
mod2.var_names
# Index(['0', '1', '2'], dtype='object')
anndata.concat([mod1, mod2]).shape
# => (20, 3)

I can assume peaks were called individually for each dataset (m9164_atac and m9412_atac), and 13 is the number of peaks that happen to have exactly the same definitions (chrN:XXX-YYY) across the samples then. For peak-based analysis, peaks have to be either called jointly or merged across samples with special procedures.

aichander commented 1 year ago

+1 to having some inbuilt functionality that lets us concatenate 2 mudata objects with shared indices.

lijxug commented 11 months ago

Any progress on this issue? Or should we do what ivirshup sugested?

gtca commented 11 months ago

Scheduled for mudata v0.3, which is in progress (https://github.com/scverse/mudata/pull/56), @lijxug!

Just to make it clear, this is about concatenation as in anndata.concat, which is not aware of genomic intervals, etc.

gtca commented 2 months ago

Concatenation based on anndata.concat should now work since v0.3. But this is a new API so please report any issues with it!