Open mikelove opened 6 years ago
Wanted to try tximeta out. With Salmon 0.14.1 I prepared a salmon index from the gencode v29 (with decoys) currently up on main salmon site (using the --gencode flag). Quantified in mapping mode. When I tried to create a SummarizedExperiment though it was unable to recognize the transcriptome. Is something wrong or is gencode v29 not implemented?
Thanks!
> se <- tximeta(samples_tximeta)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10
tximeta needs a BiocFileCache directory to access and save TxDb objects.
Do you wish to use the default directory: 'C:\Users\msmit248\AppData\Local\BiocFileCache\BiocFileCache\Cache'?
If not, a temporary directory that is specific to this R session will be used.
You can always change this directory later by running: setTximetaBFC()
Or enter [0] to exit and set this directory manually now.
1: Yes (use default)
2: No (use temp)
Selection: 2
couldn't find matching transcriptome, returning un-ranged SummarizedExperiment
> se
class: SummarizedExperiment
dim: 205870 10
metadata(3): tximetaInfo quantInfo countsFromAbundance
assays(3): counts abundance length
rownames(205870): ENST00000456328.2 ENST00000450305.2 ... ENST00000387460.2 ENST00000387461.2
rowData names(0):
colnames(10): kw01_ifng kw01_veh ... p154_ifng p154_veh
colData names(6): run person ... batch names
> dim(se)
[1] 205870 10
I think we’ll need to get Gencode + decoy hash values from @rob-p. Correct Rob? We’ll work out a pipeline.
Just more information: a short term fix would be to use linkedTxomes
to connect the index to the source yourself.
But we really want these indices to automatically connect to the reference.
@rob-p do you think we should pass the hash of the -t
transcripts alone to the JSON files, as a separate hash in addition to the transcripts plus the decoys? I'm not sure how the hashing is currently performed. Both hash values may be useful. Going forward, to connect to the GA4GH API we will need the hash value of the -t
transcripts alone.
The Gencode + decoy hash was going to break plans on integrating with GA4GH to support all txomes (as the hash value on the server side wouldn't include the decoy sequence), and so the next version of Salmon will break out the -t
hash and the decoy hash separately, so tximeta
will still work out of the box. In the meantime, you can explicitly link the txome to the GTF using makeLinkedTxome
as shown in the vignette.
This thread made me realize, the above workaround would be a useful technique to preserve the reference hash value when users want to add non-reference transcripts. For example, sometimes users will add ERCC spike-ins, viral sequences, or fusion genes. It may be useful to have a reference hash as well as a hash of non-reference sequences, and a total hash...
ERCC would be great!
Thanks for feedback @jtheorell
So we don't have this working yet, but my thoughts were that we could have Salmon distinguish between the "primary" reference sequences of interest (e.g. transcripts), plus other perhaps "technical" sequences such as spike in or decoy sequences. Salmon will quantify against all these sequences, but for the purpose of txome identification, we'd like to know the hash of the primary seqs as well as the primary plus the technical seqs. This way we will at least be able to identify the provenance of the primary. Given that the technical seqs may be very idiosyncratic, it's not likely possible to identify primary + technical without the user creating a linkedTxome
.
We don't have a formalized mechanism for this now, but it's a sketch of a solution. The current solution would be linkedTxome
+ Zenodo deposit for FASTA and GTF.
OK! Trying as good as I can to get it to work for now then. Thanks for your super rapid response!
Please add any organism or source that we are missing that you'd like to be covered by tximeta, and we will consider the best way to fold it in. We want to cover as many use cases as possible, and support and encourage
linkedTxome
for remaining cases.