tabix for gene expression

nemoarchive / analytics

Repository for the NeMO Analytics project.

MIT License

1 stars 0 forks source link

tabix for gene expression #15

Open RLC-DCPPC opened 5 years ago

RLC-DCPPC commented 5 years ago

From NeMO-Analytics created by RLC-DCPPC : RLC-DCPPC/NeMO-Analytics#14

Placeholder to make sure we discuss this.

jkanche commented 5 years ago

@apaala , here's the link to the rna-seq file I used to generate the tabix index (for the script i sent you)

https://obj.umiacs.umd.edu/bigwig-files/rna.txt

apaala commented 5 years ago

I have tested this out on a single file from Zeng, is there a particular dataset we want to generate the tabix files for? How many do we want to make?

apaala commented 5 years ago

This task requires generating a file that is like the IDxxxx_DataMTX.tab, but has 3 columns prepended to it (chr#, chr start and chr end). These files need to be generated during the uploading process. In order to perform mapping of the geneIDs, we will need to access the database that is currently on google server. It is very important that the Ensembl release number is noted in the metadata file. If it is not provided by the user, we will determine the best possible release # and make a note of it in the metadata file.

jorvis commented 5 years ago

@jkanche , I've put an example h5ad and mysql dump of the gene table on Dropbox here:

https://www.dropbox.com/sh/8p0rrk3ic8ak9rr/AACg1DK8PREyCP9-Fnsc43dqa?dl=0

Note that the table stores many different releases of each gene, so the release number is necessary. This particular dataset is release 93. If you can use this to create an example Tabix file @apaala can take over from there. Thanks.

jkanche commented 5 years ago

@jorvis I'm looking at the gene table from the sql file, if i search by ensembl id or gene symbol, do you always pick the latest ensembl version for a match ?

jorvis commented 5 years ago

First, if I know which ensembl release the dataset was annotated with I use that one. If I don't, I take the pool of gene symbols and compare them with the pool of gene symbols from each Ensembl release for that organism - whichever has the most overlap is the one I choose.

jkanche commented 5 years ago

@jorvis & @apaala

I wrote the python script to read the h5ad file and map the names to genomic position. The script can use either ensembl ids or gene symbols to map. I think the order should be

use the ensembl version if known
if ensembl ids are available, use them to do the mapping first,
use gene symbols

I posted the results for both matching by gene symbols and ensembl ids. Link to gist https://gist.github.com/jkanche/f52d0b058bc9676f9fe0fab480d1860d

apaala commented 5 years ago

@jkanche I will play with this script as soon as I get a chance. Thanks for sharing it!