Closed PeteHaitch closed 4 years ago
Thanks again Peter. Started adding in some of your ideas here:
https://github.com/mikelove/tximeta/commit/718fb82003b98e168e033ac187d2967920e88d37
Gencode is funny because it says GRCh38, but uses "chr" style seqnames. This isn't strange on second inspection, "chr1" is just how NCBI stores it, while "1" is how EBI stores it. I thought "chr1" was from UCSC at first, but now I see that UCSC vs NCBI only disagree on the name of the genome, "hg38" vs "GRCh38".
I think the solution for GRCh38 with non-"chr" seqnames will be to use ensembldb
.
Next up is some experimenting with derived txomes.
Cool, ensembldb
is absolutely the way to go for grabbing seqinfo for non-"chr" seqnames. It's all done here:
Cool, how good is it when someone's already done the work :)
I'm using this thread for keeping track of ways that this is done in different packages.
AnnotationHub
has its own version of munging together a Seqinfo object:
This is super cool, Mike and Rob!
Here are my comments and thoughts after running through the vignette and stepping through the code of
tximeta()
https://github.com/mikelove/tximeta/blob/c3e0d76dc2170b01349dd5602c1e6076b6b4c456/R/tximeta.R#L39 Perhaps
tximeta()
should accept additional args that can be passed to the call totximport()
, e.g., from memory I've used thecountsFromAbundance
arg oftximport()
to get scaled counts but I can't viatximeta()
https://github.com/mikelove/tximeta/blob/c3e0d76dc2170b01349dd5602c1e6076b6b4c456/R/tximeta.R#L42 How/when is
hashtable.csv
constructed? Is this is Salmon/Sailfash magic (e.g. inferred from the GTF/FASTA file contents or filename or the Salmon/Sailfish index) or does the user need to populate these fields manually? How robust is an automagic procedure to dumb file naming (e.g.ensembl.gtf
ormouse.fa
)?Your example GTF and FASTA files use ftp addresses. What if these are local resources (e.g.
/genomes/hg19/hg19.fa
)? Presumably the TxDb can still be constructed if the path is absolute and you're on the right machine but otherwise you're stuck?https://github.com/mikelove/tximeta/blob/c3e0d76dc2170b01349dd5602c1e6076b6b4c456/R/tximeta.R#L59 It'd be cool to try integrating with BiocFileCache by @lshep to be able to use the TxDb object across different projects.
https://github.com/mikelove/tximeta/blob/c3e0d76dc2170b01349dd5602c1e6076b6b4c456/R/tximeta.R#L69
When might there be transcripts in the DB that aren't in the Salmon output? Should a warning/message be thrown if these are encountered? If
hashtable.csv
gets borked then might be instances where the Salmon output contains transcripts not found in the DB? How would these be handled?https://github.com/mikelove/tximeta/blob/c3e0d76dc2170b01349dd5602c1e6076b6b4c456/R/tximeta.R#L71-L79
Some/most of this could be replaced by a call to
Seqinfo()
with thegenome
arg, e.g.,Minor point: Using either approach,
genome(txps)
is nowhg38
rather thanGRCh38
.