thelovelab / tximeta

Transcript quantification import with automatic metadata detection
https://thelovelab.github.io/tximeta/
66 stars 11 forks source link

Is it possible to extract transcript nucleotide sequence? #43

Closed kvittingseerup closed 4 years ago

kvittingseerup commented 4 years ago

Since tximeta also know both know about the linked GTF and corresponding nucleotide fasta file(s) is it possible to extract the underlying transcript nucleotide sequences? Or is that only used for generating the checksums?

Cheers Kristoffer

mikelove commented 4 years ago

I think I already have unexported code to fetch the sequence, I’ll work on this.

mikelove commented 4 years ago

I added retrieveCDNA in 38623e8

e.g.

cdna <- retrieveCDNA(se)

Because the DNAStringSet is a pretty big object with respect to the SummarizedExperiment, I think it makes sense to provide the cDNA separately, so similar to the existing retrieveDb function, which does not return a SummarizedExperiment. This also uses the caching system, so it will only download once, and subsequently load from tximeta's cache.

I also don't do any ordering or matching of the cdna to the rows of se for a number of reasons. One is that this is the easiest approach, as otherwise, what to do with transcript sequence that isn't matching on the rows of se (e.g. duplicates removed by Salmon indexing, or missing from GTF). Secondly, this allows the function to work even when se has been summarized to gse. So it's just pulling down the exact transcript sequence that was provided to salmon index, no more no less.

kvittingseerup commented 4 years ago

It works very nicely 👍 . If you wan to be meticulous the transcript names of the DNAStringSet does not match what is used in the RangedSummarizedExperiment:

ntSeq <- retrieveCDNA(se)
head(names(ntSeq),2)
[1] "FBtr0070129 cdna chromosome:BDGP6.22:X:656673:657899:1 gene:FBgn0025637 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:SkpA"
[2] "FBtr0070126 cdna chromosome:BDGP6.22:X:656356:657899:1 gene:FBgn0025637 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:SkpA"

 head( rowData(se)$tx_id, 2)
[1] "FBtr0070129" "FBtr0070126"

Do you have some unexported code to fix this as well or is that handled by Salmon originally?

mikelove commented 4 years ago

This differs by organism and by source, and rather than do a bigger implementation to try to cover all the cases, I thought to just pass the data directly.

I believe this is handled upstream by salmon index.

kvittingseerup commented 4 years ago

Guess I will have to do it then ;-)