Closed kvittingseerup closed 4 years ago
I think I already have unexported code to fetch the sequence, I’ll work on this.
I added retrieveCDNA
in 38623e8
e.g.
cdna <- retrieveCDNA(se)
Because the DNAStringSet is a pretty big object with respect to the SummarizedExperiment, I think it makes sense to provide the cDNA separately, so similar to the existing retrieveDb
function, which does not return a SummarizedExperiment. This also uses the caching system, so it will only download once, and subsequently load from tximeta's cache.
I also don't do any ordering or matching of the cdna
to the rows of se
for a number of reasons. One is that this is the easiest approach, as otherwise, what to do with transcript sequence that isn't matching on the rows of se
(e.g. duplicates removed by Salmon indexing, or missing from GTF). Secondly, this allows the function to work even when se
has been summarized to gse
. So it's just pulling down the exact transcript sequence that was provided to salmon index
, no more no less.
It works very nicely 👍 . If you wan to be meticulous the transcript names of the DNAStringSet does not match what is used in the RangedSummarizedExperiment:
ntSeq <- retrieveCDNA(se)
head(names(ntSeq),2)
[1] "FBtr0070129 cdna chromosome:BDGP6.22:X:656673:657899:1 gene:FBgn0025637 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:SkpA"
[2] "FBtr0070126 cdna chromosome:BDGP6.22:X:656356:657899:1 gene:FBgn0025637 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:SkpA"
head( rowData(se)$tx_id, 2)
[1] "FBtr0070129" "FBtr0070126"
Do you have some unexported code to fix this as well or is that handled by Salmon originally?
This differs by organism and by source, and rather than do a bigger implementation to try to cover all the cases, I thought to just pass the data directly.
I believe this is handled upstream by salmon index
.
Guess I will have to do it then ;-)
Since tximeta also know both know about the linked GTF and corresponding nucleotide fasta file(s) is it possible to extract the underlying transcript nucleotide sequences? Or is that only used for generating the checksums?
Cheers Kristoffer