Missing my favorite organism/source [please add as a comment to this issue] - Githubissues

thelovelab / tximeta

Transcript quantification import with automatic metadata detection

https://thelovelab.github.io/tximeta/

67 stars 11 forks source link

Missing my favorite organism/source [please add as a comment to this issue] #13

Open mikelove opened 6 years ago

mikelove commented 6 years ago

Please add any organism or source that we are missing that you'd like to be covered by tximeta, and we will consider the best way to fold it in. We want to cover as many use cases as possible, and support and encourage linkedTxome for remaining cases.

matthewdavidsmith commented 5 years ago

Wanted to try tximeta out. With Salmon 0.14.1 I prepared a salmon index from the gencode v29 (with decoys) currently up on main salmon site (using the --gencode flag). Quantified in mapping mode. When I tried to create a SummarizedExperiment though it was unable to recognize the transcriptome. Is something wrong or is gencode v29 not implemented?

Thanks!

> se <- tximeta(samples_tximeta)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 

tximeta needs a BiocFileCache directory to access and save TxDb objects.
Do you wish to use the default directory: 'C:\Users\msmit248\AppData\Local\BiocFileCache\BiocFileCache\Cache'?
If not, a temporary directory that is specific to this R session will be used.

You can always change this directory later by running: setTximetaBFC()
Or enter [0] to exit and set this directory manually now. 

1: Yes (use default)
2: No (use temp)

Selection: 2
couldn't find matching transcriptome, returning un-ranged SummarizedExperiment
> se
class: SummarizedExperiment 
dim: 205870 10 
metadata(3): tximetaInfo quantInfo countsFromAbundance
assays(3): counts abundance length
rownames(205870): ENST00000456328.2 ENST00000450305.2 ... ENST00000387460.2 ENST00000387461.2
rowData names(0):
colnames(10): kw01_ifng kw01_veh ... p154_ifng p154_veh
colData names(6): run person ... batch names
> dim(se)
[1] 205870     10

mikelove commented 5 years ago

I think we’ll need to get Gencode + decoy hash values from @rob-p. Correct Rob? We’ll work out a pipeline.

mikelove commented 5 years ago

Just more information: a short term fix would be to use linkedTxomes to connect the index to the source yourself.

But we really want these indices to automatically connect to the reference.

@rob-p do you think we should pass the hash of the -t transcripts alone to the JSON files, as a separate hash in addition to the transcripts plus the decoys? I'm not sure how the hashing is currently performed. Both hash values may be useful. Going forward, to connect to the GA4GH API we will need the hash value of the -t transcripts alone.

mikelove commented 5 years ago

The Gencode + decoy hash was going to break plans on integrating with GA4GH to support all txomes (as the hash value on the server side wouldn't include the decoy sequence), and so the next version of Salmon will break out the -t hash and the decoy hash separately, so tximeta will still work out of the box. In the meantime, you can explicitly link the txome to the GTF using makeLinkedTxome as shown in the vignette.

mikelove commented 5 years ago

This thread made me realize, the above workaround would be a useful technique to preserve the reference hash value when users want to add non-reference transcripts. For example, sometimes users will add ERCC spike-ins, viral sequences, or fusion genes. It may be useful to have a reference hash as well as a hash of non-reference sequences, and a total hash...

jtheorell commented 5 years ago

ERCC would be great!

mikelove commented 5 years ago

Thanks for feedback @jtheorell

So we don't have this working yet, but my thoughts were that we could have Salmon distinguish between the "primary" reference sequences of interest (e.g. transcripts), plus other perhaps "technical" sequences such as spike in or decoy sequences. Salmon will quantify against all these sequences, but for the purpose of txome identification, we'd like to know the hash of the primary seqs as well as the primary plus the technical seqs. This way we will at least be able to identify the provenance of the primary. Given that the technical seqs may be very idiosyncratic, it's not likely possible to identify primary + technical without the user creating a linkedTxome.

We don't have a formalized mechanism for this now, but it's a sketch of a solution. The current solution would be linkedTxome + Zenodo deposit for FASTA and GTF.

jtheorell commented 5 years ago

OK! Trying as good as I can to get it to work for now then. Thanks for your super rapid response!