Closed cartal closed 5 years ago
If you want to extract it in python, you can load the h5ad file using adata = sc.read(filename)
and then use adata.X
, which is the expression matrix.
To extract the matrix into R, you can use the rhdf5
library. That's a bit more complicated as there was a recent update to this library I believe. Note that you need to transpose the expression matrix from python into R due to different conventions (R expects a genes x cells matrix, python a cells x genes matrix).
An alternative to the rhdf5
library is to just save the expression matrix via numpy.savetxt()
to save it, for example, as a space-delimited file.
I hope this helps.
Hi, this is very useful. But I think I formulated my question the wrong way.
Can I export the h5ad file to a standard 10X h5 file?
That, I can't help you with I'm afraid. I'm not as familiar with the h5ad format.
This is an interesting question... I just googled it and found only examples in R, but that may be good enough for you: https://rdrr.io/github/MarioniLab/DropletUtils/man/write10xCounts.html
Hi @cartal, it wouldn't be very hard to export to a 10x h5 file, but I'd need to write a custom function for it. Why is it needed? Does 10x offer any downstream analysis that you'd want to use on the data? I thought there are none, hence there is only sc.read_10x_h5
and no sc.write_10x_h5
.
Hi, I'm sorry I forgot about this trend, I just stumble into the same issue.
Lets say I have done my analysis in scanpy and everything is good and nice, but now I want to run, say, the cluster 10 from the louvain subset, with Palantir. Palantir can read 10X and 10X_H5 files. Is there a way to plug-and-play this with scanpy?
In another case, if I want to extract the subset expression matrix, where rows are genes (with rownames as gene symbols) and columns are cells (with colnames as cells), so I can use this with SCENIC. How could I get this from the scanpy adata
?
I apologise in advance if I'm asking something very basic, but it will be really nice to have some sort of interconectivity between tools, since scanpy is so nice to have as a major analysis suite.
You can always export as a .csv
file and read that into other tools using adata.write_csvs(filename, skip_data=False)
. You can call adata.T
for transposing before.
I can imagine that Palantir would also accept AnnData
objects, you could make an issue there. Also, have you tried tl.paga
for trajectory inference, the paper is here?
Hi, I am trying PAGA. Thanks a lot for the help.
@falexwolf @cartal @LuckyMD I am also trying to export a gene by cell expression file. I tried using adata.write_csvs(filename, skip_data=False) but that wrote the output to multiple files. Is there a way to generate a single file with genes as rows (with gene names as row IDs) and cells as columns (barcodes as column IDs) ?
Thank you in advance for your help.
No, there is no way to produce a single file with data and metadata. Having genes as rows can simply be achieved by transposing the matrix (adata.T.write_csvs(...)
).
In exporting.cellbrowser the cellbrowser export function creates two files, one with the expression matrix (genes on rows, cells as columns) and one for the cell metadata. That may be what you're looking for?
@falexwolf thanks for the feedback. As @maximilianh suggested, I was able to export the expression matrix from the cellbrowser export function. Thank you for your help.
Nice to hear that it worked. As a side product, you can now create a cell browser html directory from the generated directory.
On Thu, Feb 7, 2019 at 10:21 AM aditisk notifications@github.com wrote:
@falexwolf https://github.com/falexwolf thanks for the feedback. As @maximilianh https://github.com/maximilianh suggested, I was able to export the expression matrix from the cellbrowser export function. Thank you for your help.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/issues/262#issuecomment-461540169, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TXGCkdaQWO8ks_x7uOm-P2_ISArRks5vLG6RgaJpZM4Wne7Z .
@maximilianh I was able to use the cell browser export function in the past but this time I am getting an error message:
INFO:root:Writing scanpy matrix to adata_cellbrowser_04_01_19_CD8_subclustered/exprMatrix.tsv.gz INFO:root:Transposing matrix INFO:root:Writing gene-by-gene, without using pandas INFO:root:Writing 8068 genes in total INFO:root:Wrote 0 genes INFO:root:Wrote 2000 genes INFO:root:Wrote 4000 genes INFO:root:Wrote 6000 genes INFO:root:Wrote 8000 genes INFO:root:Writing UMAP coords to adata_cellbrowser_04_01_19_CD8_subclustered/umap_coords.tsv ERROR:root:Couldnt find cluster markers list
I am using an h5ad file to import my ann data object. Is that why there is some issue with finding cluster markers ? I am able to plot the clusters in a UMAP plot so I know that the 'louvain' observation exists. Any thoughts on why this is happening ?
Thanks.
Just a thought... have you run sc.tl.rank_genes_groups()
to obtain cluster markers?
Yes, it seems that I should change this to a warning. People may not want cluster-specific marker genes in their browser. Do you agree @aditisk ?
@LuckyMD I did not run sc.tl.rank_genes_groups() which was the problem.
@maximilianh I think it should be optional to include the cluster-specific markers so maybe keeping it as a warning might be the best ? That way the user has control on what they want to include.
@maximilianh I think those messages are from your code? maybe you should improve the error message to include something like
Try running sc.tl.rank_genes_groups(adata) to create the cluster annotation
Hi, the expression matrix I exported from adata.write only have the top variable genes. Is there a way to output the raw matrix including all genes?
The scanpyToCellbrowser function has an option useRaw that will use the .raw matrix, if present, for the .tsv export.
Otherwise, the raw matrix of all genes is stored as ad.raw.X and the variable names are in ad.raw.var. You can use scanpyToCellbrowser to write the matrix and all annotations, or anndataToTsv to write just the matrix. Or use code from there to write your own.
On Fri, May 31, 2019 at 5:14 PM Jing He notifications@github.com wrote:
Hi, the expression matrix I exported from adata.write only have the top variable genes. Is there a way to output the raw matrix including all genes?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/issues/262?email_source=notifications&email_token=AACL4TNOFS6MLIH44P6J5HDPYE6ENA5CNFSM4FU553M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWVQBGI#issuecomment-497746073, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TORHPOQ2GTWTUGTAI3PYE6ENANCNFSM4FU553MQ .
Thanks for the suggestion. Actually, I am using cellxgene which takes the h5ad file as an input. when using anndata.write() function, it only output the anndata.X as the expression matrix. And also there is no option of useRaw here. Also, I tried to re-assign anndata.X = anndata.raw.X, but it returns an error saying its wrong shape. Do you have any suggestions?
Thanks a lot!
On Mon, Jun 3, 2019 at 6:03 AM Maximilian Haeussler < notifications@github.com> wrote:
The scanpyToCellbrowser function has an option useRaw that will use the .raw matrix, if present, for the .tsv export.
Otherwise, the raw matrix of all genes is stored as ad.raw.X and the variable names are in ad.raw.var. You can use scanpyToCellbrowser to write the matrix and all annotations, or anndataToTsv to write just the matrix. Or use code from there to write your own.
On Fri, May 31, 2019 at 5:14 PM Jing He notifications@github.com wrote:
Hi, the expression matrix I exported from adata.write only have the top variable genes. Is there a way to output the raw matrix including all genes?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/theislab/scanpy/issues/262?email_source=notifications&email_token=AACL4TNOFS6MLIH44P6J5HDPYE6ENA5CNFSM4FU553M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWVQBGI#issuecomment-497746073 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AACL4TORHPOQ2GTWTUGTAI3PYE6ENANCNFSM4FU553MQ
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/issues/262?email_source=notifications&email_token=AAUAIIIOXG5HSDCKTFYS7KLPYTT6BA5CNFSM4FU553M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWY5RAA#issuecomment-498194560, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUAIIM2RYJQTSDTKQ4XZLTPYTT6BANCNFSM4FU553MQ .
-- Cheers! Jing
Hi @hejing3283,
The wrong shape is probably because you have subsetted adata.X
to highly variable genes, or did some additional filtering after storing data in adata.raw
. For a while now scanpy avoids filtering highly variable genes, but instead annotates them in adata.var['highly_variable']
which is then used in sc.pp.pca()
. I would suggest you use subset=False
next time you use sc.pp.highly_variable()
to avoid different dimensions in adata.X
and adata.raw.X
.
You can easily proceed by just making a new anndata object from adata.raw.X
, adata.raw.var
and adata.raw.obs
and storing this to be loaded into cellxgene. Just do the following:
adata_raw = sc.AnnData(X=adata.raw.X, obs=adata.raw.obs, var=adata.raw.var)
adata_raw.write(my_file)
Hi,
Thanks so much for the explanations! Doing it now and it works.
Best, Jing
On Jun 5, 2019, at 10:39, MalteDLuecken notifications@github.com wrote:
Hi @hejing3283,
The wrong shape is probably because you have subsetted adata.X to highly variable genes, or did some additional filtering after storing data in adata.raw. For a while now scanpy avoids filtering highly variable genes, but instead annotates them in adata.var['highly_variable'] which is then used in sc.pp.pca(). I would suggest you use subset=False next time you use sc.pp.highly_variable() to avoid different dimensions in adata.X and adata.raw.X.
You can easily proceed by just making a new anndata object from adata.raw.X, adata.raw.var and adata.raw.obs and storing this to be loaded into cellxgene. Just do the following:
adata_raw.write(my_file) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Here's my solution to a similar inter-operability hiccup. This produces files similar to 10X v2 triplet format, plus an extra cell metadata file.
pd.DataFrame(ad.var.index).to_csv(os.path.join(destination, "genes.tsv" ), sep = "\t", index_col = False)
pd.DataFrame(ad.obs.index).to_csv(os.path.join(destination, "barcodes.tsv"), sep = "\t", index_col = False)
ad.obs.to_csv(os.path.join(destination, "metadata.tsv"), sep = "\t", index_col = True)
scipy.io.mmwrite(os.path.join(destination, "matrix.mtx"), ad.X)
You can convert h5ad format to Seurat object using sceasy.
To write a genes (rows) vs. cells (columns) matrix, i tried this
adata.T.to_df().to_csv('matrix.csv')
Use this code: import datatable as dt X = pd.DataFrame(adata.X.toarray().T,columns=adata.obs_names) Symbol=dt.Frame({'Symbol':adata.var_names.values}) X = dt.cbind([Symbol,dt.Frame(X)]) X.to_csv('X.csv')
scipy.io.mmwrite
This code doesn't actually work - rows and columns are switched in the matrix, and it produces an error when you try to read in the output using either Scanpy
or Seurat
wrapper functions. Perhaps it's a package version thing though..
scipy.io.mmwrite
This code doesn't actually work - rows and columns are switched in the matrix, and it produces an error when you try to read in the output using either
Scanpy
orSeurat
wrapper functions. Perhaps it's a package version thing though..
I was having the same issue as well. I ended up doing what was suggested above:
adata.T.to_df().to_csv('matrix.csv')
We have this method in scanpy-scripts https://github.com/ebi-gene-expression-group/scanpy-scripts/blob/6297be21119d6964e074fa0b40a3b6fcaec53bbc/scanpy_scripts/cmd_utils.py#L137 - you could as well use it just from there with one of the containers https://quay.io/repository/biocontainers/scanpy-scripts?tab=tags&tag=latest I think it can be used through the filtering CLI call, given numbers that won't filter anything out.
Hi,
I would like to extract the expression matrix (genes and counts) from a h5ad file. How can I do this? I have searched the documentation but I couldn't find anything about this (maybe I missed it).