AnnData conversion no longer grabs var['gene_symbols'] to get gene symbols

maximilianh / cellBrowser

main repo: https://github.com/ucscGenomeBrowser/cellBrowser/ - Python pipeline and Javascript scatter plot library for single-cell datasets, http://cellbrowser.rtfd.org

https://github.com/ucscGenomeBrowser/cellBrowser/

GNU General Public License v3.0

102 stars 40 forks source link

AnnData conversion no longer grabs var['gene_symbols'] to get gene symbols #216

Closed pcm32 closed 1 year ago

pcm32 commented 3 years ago

This has stopped working between 0.5.x and 1.0.0. I had added this feature in #118 but it seems to have been reverted. Can we please re-instate the functionality? It is relevant for our Single Cell Expression Atlas AnnData files.

pcm32 commented 3 years ago

I started seeing this when adding Scanpy and Pandas to the bioconda deps. Probably since the route is different when pandas is installed.

maximilianh commented 3 years ago

Are you using the raw values from the matrix or the processed values? Is it possible ad.raw.vars doesn't have the gene symbols?

On Fri, Apr 16, 2021 at 1:14 PM Pablo Moreno @.***> wrote:

I started seeing this when adding Scanpy and Pandas to the bioconda deps. Probably since the route is different when pandas is installed.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/216#issuecomment-821103406, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TMT5L6MWCJI3XMZAJDTJALYLANCNFSM43BKJMNQ .

pcm32 commented 3 years ago

I have checked, and the AnnData does have vars['gene_symbols'].... not sure about raw.vars...

pcm32 commented 3 years ago

how do I tell it to use the raw values? I always assumed it was using the processed values (cbScanpy...)

maximilianh commented 3 years ago

This is the code and it seems to get gene_symbols:

    # when reading 10X files, read_h5 puts the geneIds into a separate field
    # and uses only the symbol. We prefer ENSGxxxx|<symbol> as the gene ID string
    if "gene_ids" in var:
        genes = geneSeriesToStrings(var["gene_ids"], indexFirst=False)
    elif "gene_symbols" in var:
        genes = geneSeriesToStrings(var["gene_symbols"], indexFirst=True)
    elif "Accession" in var: # only seen this in the ABA Loom files
        genes = geneSeriesToStrings(var["Accession"], indexFirst=False)
    else:
        genes = var.index.tolist()

Is it possible that your object contains a gene_ids slot ? I have a feeling this is the same problem as the other issue that you opened. Should we close this?

maximilianh commented 3 years ago

As for the other question: you're right, it gets the raw data by default. I didn't mean that it defaults to raw.

Right now, the function uses the raw values only if you force it to:

anndataMatrixToTsv(ad, matFname, usePandas=False, useRaw=False)

because this option is not something most people want, it's not exposed on the Unix command line yet. I was asking in case you're calling it from python yourself.

maximilianh commented 2 years ago

I believe that this is solved now, is this correct? Can we close this ticket?

maximilianh commented 2 years ago

Hey @pcm32, in PR https://github.com/maximilianh/cellBrowser/pull/231 by @redst4r we're discussing moving everywhere to the .mtx.gz format by default. I'm tending to stick with .tsv.gz for now, but give an option to .mtx.gz is used if you specify "-f mtx". Any thoughts?

maximilianh commented 2 years ago

In your pipelines do you have assumptions about the name of the output file?

matthewspeir commented 2 years ago

@pcm32 and @maximilianh, is there anything else that needs to be done here? Or can we close this?

maximilianh commented 2 years ago

cbImportScanpy now has this option, I think this answers @pcm32 question:

--proc when exporting, do not use the raw input data, instead use the normalized and corrected matrix scanpy. This has no effect if the anndata.raw attribute is not used in the anndata object

On Fri, May 27, 2022 at 1:00 AM Matt Speir @.***> wrote:

@pcm32 https://github.com/pcm32 and @maximilianh https://github.com/maximilianh, is there anything else that needs to be done here? Or can we close this?

— Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/216#issuecomment-1139128028, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TNG22BKU5WKSMZZKVTVL7667ANCNFSM43BKJMNQ . You are receiving this because you were mentioned.Message ID: @.***>

matthewspeir commented 1 year ago

Sounds good! We'll close this for now.