Closed grst closed 5 years ago
The ensembl id is gonna be an issue here. I will have to modify the counting from dropseq tools which is in java or rewrite it in python. Not sure I know how to do a proper count estimation
Hmm, wasn't expecting that to be an issue.
Are the gene symbols hardcoded into the dropseq tools then? Does it not just get feature annotations from somewhere (such as, a gtf
file)?
It is using a gtf file for annotation. We could, "cheat" and change the header of ENSEMBL IDs to gene_name and it should work. But I have to test it
if there are multiple entries it also outputs HGNC.1
HGNC.2
and so on.
If that's in the same order we could just remap it. Also, in that case I don't see any reason why the 'hack' you propose should not work.
We have to be careful though. 3" end capturing is different than normal RNA seq. You would see different transcripts on a normal RNA seq (or 5" scRNAseq) pop up. Normally, you should not see similar patterns in 3" end capturing since, normally, you only capture the end of the gene. So having an overlap over different intron/exon should not pop up.
I think I have not fully understood this. Maybe we can discuss it when you're back.
We could also have a look how they do it in cellranger. The mtx files from 10x usually have ENSEMBL identifiers.
I have the feeling that HGNC.X
uniquely maps to ensembl anyway, e.g.
http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000236478;r=2:216174896-216176032;t=ENST00000441511
It's probably worth to reprocess everything:
@mlist tries to get access to the protected access datasets.