Initial thoughts - Githubissues

PeteHaitch commented 6 years ago

This is super cool, Mike and Rob!

Here are my comments and thoughts after running through the vignette and stepping through the code of tximeta()

https://github.com/mikelove/tximeta/blob/c3e0d76dc2170b01349dd5602c1e6076b6b4c456/R/tximeta.R#L39 Perhaps tximeta() should accept additional args that can be passed to the call to tximport(), e.g., from memory I've used the countsFromAbundance arg of tximport() to get scaled counts but I can't via tximeta()

https://github.com/mikelove/tximeta/blob/c3e0d76dc2170b01349dd5602c1e6076b6b4c456/R/tximeta.R#L42 How/when is hashtable.csv constructed? Is this is Salmon/Sailfash magic (e.g. inferred from the GTF/FASTA file contents or filename or the Salmon/Sailfish index) or does the user need to populate these fields manually? How robust is an automagic procedure to dumb file naming (e.g. ensembl.gtf or mouse.fa)?

Your example GTF and FASTA files use ftp addresses. What if these are local resources (e.g. /genomes/hg19/hg19.fa)? Presumably the TxDb can still be constructed if the path is absolute and you're on the right machine but otherwise you're stuck?

https://github.com/mikelove/tximeta/blob/c3e0d76dc2170b01349dd5602c1e6076b6b4c456/R/tximeta.R#L59 It'd be cool to try integrating with BiocFileCache by @lshep to be able to use the TxDb object across different projects.

https://github.com/mikelove/tximeta/blob/c3e0d76dc2170b01349dd5602c1e6076b6b4c456/R/tximeta.R#L69

When might there be transcripts in the DB that aren't in the Salmon output? Should a warning/message be thrown if these are encountered? If hashtable.csv gets borked then might be instances where the Salmon output contains transcripts not found in the DB? How would these be handled?

https://github.com/mikelove/tximeta/blob/c3e0d76dc2170b01349dd5602c1e6076b6b4c456/R/tximeta.R#L71-L79

Some/most of this could be replaced by a call to Seqinfo() with the genome arg, e.g.,

message("fetching genome info")
genome <- hashtable$genome[idx]
ucsc_genome <- genome2UCSC(genome)
seqinfo(txps) <- Seqinfo(genome = ucsc_genome)[seqlevels(txps)] # Subset to only retain 'active' seqlevels

Minor point: Using either approach, genome(txps) is now hg38 rather than GRCh38.

mikelove commented 6 years ago

yes, additional args should be passed to tximport
the idea is there will be a central hash database for txomes from standard sources and organisms, but this can be augmented by a user to point to local files or structured minimal information to recreate derived txomes from source(s). filenames are ignored, as are sequence names, only the sequence itself contributes to the signature
thanks for the pointer to BiocFileCache, will use
there shouldn't be txps in the TxDb that aren't in the Salmon index, so yes a message should be printed at least
thanks for pointer to Seqinfo(), will use for UCSC genomes
genome2UCSC() is just a hack for the prototype, and i should really be using a call to NCBI or Ensembl for GRCh38

mikelove commented 6 years ago

Thanks again Peter. Started adding in some of your ideas here:

https://github.com/mikelove/tximeta/commit/718fb82003b98e168e033ac187d2967920e88d37

~~Gencode is funny because it says GRCh38, but uses "chr" style seqnames.~~ This isn't strange on second inspection, "chr1" is just how NCBI stores it, while "1" is how EBI stores it. I thought "chr1" was from UCSC at first, but now I see that UCSC vs NCBI only disagree on the name of the genome, "hg38" vs "GRCh38".

I think the solution for GRCh38 with non-"chr" seqnames will be to use ensembldb.

Next up is some experimenting with derived txomes.

mikelove commented 6 years ago

Cool, ensembldb is absolutely the way to go for grabbing seqinfo for non-"chr" seqnames. It's all done here:

https://github.com/jotsetung/ensembldb/blob/bc8cef9d9bfd9c79d858c8af437d21dc1f9a0f6b/R/functions-create-EnsDb.R#L1656-L1659

PeteHaitch commented 6 years ago

Cool, how good is it when someone's already done the work :)

mikelove commented 6 years ago

I'm using this thread for keeping track of ways that this is done in different packages.

AnnotationHub has its own version of munging together a Seqinfo object:

https://github.com/Bioconductor/AnnotationHub/blob/4737e81ca6bbe5a41ba8a5afe7caac5f401d54e2/R/utilities.R#L121-L132

thelovelab / tximeta

Initial thoughts #1