thelovelab / tximeta

Transcript quantification import with automatic metadata detection
https://thelovelab.github.io/tximeta/
64 stars 11 forks source link

Initial thoughts #1

Closed PeteHaitch closed 4 years ago

PeteHaitch commented 6 years ago

This is super cool, Mike and Rob!

Here are my comments and thoughts after running through the vignette and stepping through the code of tximeta()


https://github.com/mikelove/tximeta/blob/c3e0d76dc2170b01349dd5602c1e6076b6b4c456/R/tximeta.R#L39 Perhaps tximeta() should accept additional args that can be passed to the call to tximport(), e.g., from memory I've used the countsFromAbundance arg of tximport() to get scaled counts but I can't via tximeta()


https://github.com/mikelove/tximeta/blob/c3e0d76dc2170b01349dd5602c1e6076b6b4c456/R/tximeta.R#L42 How/when is hashtable.csv constructed? Is this is Salmon/Sailfash magic (e.g. inferred from the GTF/FASTA file contents or filename or the Salmon/Sailfish index) or does the user need to populate these fields manually? How robust is an automagic procedure to dumb file naming (e.g. ensembl.gtf or mouse.fa)?

Your example GTF and FASTA files use ftp addresses. What if these are local resources (e.g. /genomes/hg19/hg19.fa)? Presumably the TxDb can still be constructed if the path is absolute and you're on the right machine but otherwise you're stuck?


https://github.com/mikelove/tximeta/blob/c3e0d76dc2170b01349dd5602c1e6076b6b4c456/R/tximeta.R#L59 It'd be cool to try integrating with BiocFileCache by @lshep to be able to use the TxDb object across different projects.


https://github.com/mikelove/tximeta/blob/c3e0d76dc2170b01349dd5602c1e6076b6b4c456/R/tximeta.R#L69

When might there be transcripts in the DB that aren't in the Salmon output? Should a warning/message be thrown if these are encountered? If hashtable.csv gets borked then might be instances where the Salmon output contains transcripts not found in the DB? How would these be handled?


https://github.com/mikelove/tximeta/blob/c3e0d76dc2170b01349dd5602c1e6076b6b4c456/R/tximeta.R#L71-L79

Some/most of this could be replaced by a call to Seqinfo() with the genome arg, e.g.,

message("fetching genome info")
genome <- hashtable$genome[idx]
ucsc_genome <- genome2UCSC(genome)
seqinfo(txps) <- Seqinfo(genome = ucsc_genome)[seqlevels(txps)] # Subset to only retain 'active' seqlevels

Minor point: Using either approach, genome(txps) is now hg38 rather than GRCh38.


mikelove commented 6 years ago
mikelove commented 6 years ago

Thanks again Peter. Started adding in some of your ideas here:

https://github.com/mikelove/tximeta/commit/718fb82003b98e168e033ac187d2967920e88d37

Gencode is funny because it says GRCh38, but uses "chr" style seqnames. This isn't strange on second inspection, "chr1" is just how NCBI stores it, while "1" is how EBI stores it. I thought "chr1" was from UCSC at first, but now I see that UCSC vs NCBI only disagree on the name of the genome, "hg38" vs "GRCh38".

I think the solution for GRCh38 with non-"chr" seqnames will be to use ensembldb.

Next up is some experimenting with derived txomes.

mikelove commented 6 years ago

Cool, ensembldb is absolutely the way to go for grabbing seqinfo for non-"chr" seqnames. It's all done here:

https://github.com/jotsetung/ensembldb/blob/bc8cef9d9bfd9c79d858c8af437d21dc1f9a0f6b/R/functions-create-EnsDb.R#L1656-L1659

PeteHaitch commented 6 years ago

Cool, how good is it when someone's already done the work :)

mikelove commented 6 years ago

I'm using this thread for keeping track of ways that this is done in different packages.

AnnotationHub has its own version of munging together a Seqinfo object:

https://github.com/Bioconductor/AnnotationHub/blob/4737e81ca6bbe5a41ba8a5afe7caac5f401d54e2/R/utilities.R#L121-L132