grst / single_cell_data_integration

1 stars 0 forks source link

get raw data (FASTQ) and process using dropSeqPipe #4

Closed grst closed 5 years ago

grst commented 5 years ago

It's probably worth to reprocess everything:

@mlist tries to get access to the protected access datasets.

Hoohm commented 5 years ago

The ensembl id is gonna be an issue here. I will have to modify the counting from dropseq tools which is in java or rewrite it in python. Not sure I know how to do a proper count estimation

grst commented 5 years ago

Hmm, wasn't expecting that to be an issue.

Are the gene symbols hardcoded into the dropseq tools then? Does it not just get feature annotations from somewhere (such as, a gtf file)?

Hoohm commented 5 years ago

It is using a gtf file for annotation. We could, "cheat" and change the header of ENSEMBL IDs to gene_name and it should work. But I have to test it

grst commented 5 years ago

if there are multiple entries it also outputs HGNC.1 HGNC.2 and so on. If that's in the same order we could just remap it. Also, in that case I don't see any reason why the 'hack' you propose should not work.

Hoohm commented 5 years ago

We have to be careful though. 3" end capturing is different than normal RNA seq. You would see different transcripts on a normal RNA seq (or 5" scRNAseq) pop up. Normally, you should not see similar patterns in 3" end capturing since, normally, you only capture the end of the gene. So having an overlap over different intron/exon should not pop up.

grst commented 5 years ago

I think I have not fully understood this. Maybe we can discuss it when you're back.

We could also have a look how they do it in cellranger. The mtx files from 10x usually have ENSEMBL identifiers.

grst commented 5 years ago

I have the feeling that HGNC.X uniquely maps to ensembl anyway, e.g. http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000236478;r=2:216174896-216176032;t=ENST00000441511