scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.89k stars 594 forks source link

Export expression matrix from h5ad #262

Closed cartal closed 5 years ago

cartal commented 6 years ago

Hi,

I would like to extract the expression matrix (genes and counts) from a h5ad file. How can I do this? I have searched the documentation but I couldn't find anything about this (maybe I missed it).

LuckyMD commented 6 years ago

If you want to extract it in python, you can load the h5ad file using adata = sc.read(filename) and then use adata.X, which is the expression matrix.

To extract the matrix into R, you can use the rhdf5 library. That's a bit more complicated as there was a recent update to this library I believe. Note that you need to transpose the expression matrix from python into R due to different conventions (R expects a genes x cells matrix, python a cells x genes matrix).

An alternative to the rhdf5 library is to just save the expression matrix via numpy.savetxt() to save it, for example, as a space-delimited file.

I hope this helps.

cartal commented 6 years ago

Hi, this is very useful. But I think I formulated my question the wrong way.

Can I export the h5ad file to a standard 10X h5 file?

LuckyMD commented 6 years ago

That, I can't help you with I'm afraid. I'm not as familiar with the h5ad format.

maximilianh commented 6 years ago

This is an interesting question... I just googled it and found only examples in R, but that may be good enough for you: https://rdrr.io/github/MarioniLab/DropletUtils/man/write10xCounts.html

falexwolf commented 6 years ago

Hi @cartal, it wouldn't be very hard to export to a 10x h5 file, but I'd need to write a custom function for it. Why is it needed? Does 10x offer any downstream analysis that you'd want to use on the data? I thought there are none, hence there is only sc.read_10x_h5 and no sc.write_10x_h5.

cartal commented 5 years ago

Hi, I'm sorry I forgot about this trend, I just stumble into the same issue.

Lets say I have done my analysis in scanpy and everything is good and nice, but now I want to run, say, the cluster 10 from the louvain subset, with Palantir. Palantir can read 10X and 10X_H5 files. Is there a way to plug-and-play this with scanpy?

In another case, if I want to extract the subset expression matrix, where rows are genes (with rownames as gene symbols) and columns are cells (with colnames as cells), so I can use this with SCENIC. How could I get this from the scanpy adata?

I apologise in advance if I'm asking something very basic, but it will be really nice to have some sort of interconectivity between tools, since scanpy is so nice to have as a major analysis suite.

falexwolf commented 5 years ago

You can always export as a .csv file and read that into other tools using adata.write_csvs(filename, skip_data=False). You can call adata.T for transposing before.

I can imagine that Palantir would also accept AnnData objects, you could make an issue there. Also, have you tried tl.paga for trajectory inference, the paper is here?

cartal commented 5 years ago

Hi, I am trying PAGA. Thanks a lot for the help.

aditisk commented 5 years ago

@falexwolf @cartal @LuckyMD I am also trying to export a gene by cell expression file. I tried using adata.write_csvs(filename, skip_data=False) but that wrote the output to multiple files. Is there a way to generate a single file with genes as rows (with gene names as row IDs) and cells as columns (barcodes as column IDs) ?

Thank you in advance for your help.

falexwolf commented 5 years ago

No, there is no way to produce a single file with data and metadata. Having genes as rows can simply be achieved by transposing the matrix (adata.T.write_csvs(...)).

maximilianh commented 5 years ago

In exporting.cellbrowser the cellbrowser export function creates two files, one with the expression matrix (genes on rows, cells as columns) and one for the cell metadata. That may be what you're looking for?

aditisk commented 5 years ago

@falexwolf thanks for the feedback. As @maximilianh suggested, I was able to export the expression matrix from the cellbrowser export function. Thank you for your help.

maximilianh commented 5 years ago

Nice to hear that it worked. As a side product, you can now create a cell browser html directory from the generated directory.

On Thu, Feb 7, 2019 at 10:21 AM aditisk notifications@github.com wrote:

@falexwolf https://github.com/falexwolf thanks for the feedback. As @maximilianh https://github.com/maximilianh suggested, I was able to export the expression matrix from the cellbrowser export function. Thank you for your help.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/issues/262#issuecomment-461540169, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TXGCkdaQWO8ks_x7uOm-P2_ISArRks5vLG6RgaJpZM4Wne7Z .

aditisk commented 5 years ago

@maximilianh I was able to use the cell browser export function in the past but this time I am getting an error message:

INFO:root:Writing scanpy matrix to adata_cellbrowser_04_01_19_CD8_subclustered/exprMatrix.tsv.gz INFO:root:Transposing matrix INFO:root:Writing gene-by-gene, without using pandas INFO:root:Writing 8068 genes in total INFO:root:Wrote 0 genes INFO:root:Wrote 2000 genes INFO:root:Wrote 4000 genes INFO:root:Wrote 6000 genes INFO:root:Wrote 8000 genes INFO:root:Writing UMAP coords to adata_cellbrowser_04_01_19_CD8_subclustered/umap_coords.tsv ERROR:root:Couldnt find cluster markers list

I am using an h5ad file to import my ann data object. Is that why there is some issue with finding cluster markers ? I am able to plot the clusters in a UMAP plot so I know that the 'louvain' observation exists. Any thoughts on why this is happening ?

Thanks.

LuckyMD commented 5 years ago

Just a thought... have you run sc.tl.rank_genes_groups() to obtain cluster markers?

maximilianh commented 5 years ago

Yes, it seems that I should change this to a warning. People may not want cluster-specific marker genes in their browser. Do you agree @aditisk ?

aditisk commented 5 years ago

@LuckyMD I did not run sc.tl.rank_genes_groups() which was the problem.

@maximilianh I think it should be optional to include the cluster-specific markers so maybe keeping it as a warning might be the best ? That way the user has control on what they want to include.

flying-sheep commented 5 years ago

@maximilianh I think those messages are from your code? maybe you should improve the error message to include something like

Try running sc.tl.rank_genes_groups(adata) to create the cluster annotation

hejing3283 commented 5 years ago

Hi, the expression matrix I exported from adata.write only have the top variable genes. Is there a way to output the raw matrix including all genes?

maximilianh commented 5 years ago

The scanpyToCellbrowser function has an option useRaw that will use the .raw matrix, if present, for the .tsv export.

Otherwise, the raw matrix of all genes is stored as ad.raw.X and the variable names are in ad.raw.var. You can use scanpyToCellbrowser to write the matrix and all annotations, or anndataToTsv to write just the matrix. Or use code from there to write your own.

On Fri, May 31, 2019 at 5:14 PM Jing He notifications@github.com wrote:

Hi, the expression matrix I exported from adata.write only have the top variable genes. Is there a way to output the raw matrix including all genes?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/issues/262?email_source=notifications&email_token=AACL4TNOFS6MLIH44P6J5HDPYE6ENA5CNFSM4FU553M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWVQBGI#issuecomment-497746073, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TORHPOQ2GTWTUGTAI3PYE6ENANCNFSM4FU553MQ .

hejing3283 commented 5 years ago

Thanks for the suggestion. Actually, I am using cellxgene which takes the h5ad file as an input. when using anndata.write() function, it only output the anndata.X as the expression matrix. And also there is no option of useRaw here. Also, I tried to re-assign anndata.X = anndata.raw.X, but it returns an error saying its wrong shape. Do you have any suggestions?

Thanks a lot!

On Mon, Jun 3, 2019 at 6:03 AM Maximilian Haeussler < notifications@github.com> wrote:

The scanpyToCellbrowser function has an option useRaw that will use the .raw matrix, if present, for the .tsv export.

Otherwise, the raw matrix of all genes is stored as ad.raw.X and the variable names are in ad.raw.var. You can use scanpyToCellbrowser to write the matrix and all annotations, or anndataToTsv to write just the matrix. Or use code from there to write your own.

On Fri, May 31, 2019 at 5:14 PM Jing He notifications@github.com wrote:

Hi, the expression matrix I exported from adata.write only have the top variable genes. Is there a way to output the raw matrix including all genes?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/theislab/scanpy/issues/262?email_source=notifications&email_token=AACL4TNOFS6MLIH44P6J5HDPYE6ENA5CNFSM4FU553M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWVQBGI#issuecomment-497746073 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AACL4TORHPOQ2GTWTUGTAI3PYE6ENANCNFSM4FU553MQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/issues/262?email_source=notifications&email_token=AAUAIIIOXG5HSDCKTFYS7KLPYTT6BA5CNFSM4FU553M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWY5RAA#issuecomment-498194560, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUAIIM2RYJQTSDTKQ4XZLTPYTT6BANCNFSM4FU553MQ .

-- Cheers! Jing

LuckyMD commented 5 years ago

Hi @hejing3283,

The wrong shape is probably because you have subsetted adata.X to highly variable genes, or did some additional filtering after storing data in adata.raw. For a while now scanpy avoids filtering highly variable genes, but instead annotates them in adata.var['highly_variable'] which is then used in sc.pp.pca(). I would suggest you use subset=False next time you use sc.pp.highly_variable() to avoid different dimensions in adata.X and adata.raw.X.

You can easily proceed by just making a new anndata object from adata.raw.X, adata.raw.var and adata.raw.obs and storing this to be loaded into cellxgene. Just do the following:

adata_raw = sc.AnnData(X=adata.raw.X, obs=adata.raw.obs, var=adata.raw.var)
adata_raw.write(my_file)
hejing3283 commented 5 years ago

Hi,

Thanks so much for the explanations! Doing it now and it works.

Best, Jing

On Jun 5, 2019, at 10:39, MalteDLuecken notifications@github.com wrote:

Hi @hejing3283,

The wrong shape is probably because you have subsetted adata.X to highly variable genes, or did some additional filtering after storing data in adata.raw. For a while now scanpy avoids filtering highly variable genes, but instead annotates them in adata.var['highly_variable'] which is then used in sc.pp.pca(). I would suggest you use subset=False next time you use sc.pp.highly_variable() to avoid different dimensions in adata.X and adata.raw.X.

You can easily proceed by just making a new anndata object from adata.raw.X, adata.raw.var and adata.raw.obs and storing this to be loaded into cellxgene. Just do the following:

adata_raw.write(my_file) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ekernf01 commented 4 years ago

Here's my solution to a similar inter-operability hiccup. This produces files similar to 10X v2 triplet format, plus an extra cell metadata file.

pd.DataFrame(ad.var.index).to_csv(os.path.join(destination, "genes.tsv" ),   sep = "\t", index_col = False)
pd.DataFrame(ad.obs.index).to_csv(os.path.join(destination, "barcodes.tsv"), sep = "\t", index_col = False)
ad.obs.to_csv(os.path.join(destination, "metadata.tsv"), sep = "\t", index_col = True)
scipy.io.mmwrite(os.path.join(destination, "matrix.mtx"), ad.X)
mariafiruleva commented 4 years ago

You can convert h5ad format to Seurat object using sceasy.

chris-rands commented 4 years ago

To write a genes (rows) vs. cells (columns) matrix, i tried this

adata.T.to_df().to_csv('matrix.csv')
shizhiwen1990 commented 1 year ago

Use this code: import datatable as dt X = pd.DataFrame(adata.X.toarray().T,columns=adata.obs_names) Symbol=dt.Frame({'Symbol':adata.var_names.values}) X = dt.cbind([Symbol,dt.Frame(X)]) X.to_csv('X.csv')

apredeus commented 1 year ago
scipy.io.mmwrite

This code doesn't actually work - rows and columns are switched in the matrix, and it produces an error when you try to read in the output using either Scanpy or Seurat wrapper functions. Perhaps it's a package version thing though..

indapa commented 1 year ago
scipy.io.mmwrite

This code doesn't actually work - rows and columns are switched in the matrix, and it produces an error when you try to read in the output using either Scanpy or Seurat wrapper functions. Perhaps it's a package version thing though..

I was having the same issue as well. I ended up doing what was suggested above:

adata.T.to_df().to_csv('matrix.csv')

pcm32 commented 5 months ago

We have this method in scanpy-scripts https://github.com/ebi-gene-expression-group/scanpy-scripts/blob/6297be21119d6964e074fa0b40a3b6fcaec53bbc/scanpy_scripts/cmd_utils.py#L137 - you could as well use it just from there with one of the containers https://quay.io/repository/biocontainers/scanpy-scripts?tab=tags&tag=latest I think it can be used through the filtering CLI call, given numbers that won't filter anything out.