Closed jma1991 closed 4 years ago
Thanks for reporting this. Can you provide some code that would enable import using ensembldb functions?
Another option would be to ask the ensembldb maintainer @jorainer how this case should be handled as they have constructed it to only import GTF as provided by Ensembl I suppose. We could do something inside tximeta to workaround or they may have something else in mind.
Here is my attempt at parsing the index data and providing this to the ensembldb function:
pkg <- c("ensembldb", "jsonlite")
lib <- lapply(pkg, library, character.only = TRUE)
idx <- fromJSON("results/tximeta/GRCm38.p6.EGFP.json")
sql <- ensDbFromGtf(
gtf = "results/genomepy/GRCm38.p6.EGFP/GRCm38.p6.EGFP.annotation.gtf",
outfile = tempfile(),
organism = idx$organism,
genomeVersion = idx$genome,
version = idx$release
)
"""
Importing GTF file ... OK
Processing metadata ... OK
Processing genes ...
Attribute availability:
o gene_id ... OK
o gene_name ... OK
o entrezid ... Nope
o gene_biotype ... OK
OK
Processing transcripts ...
Attribute availability:
o transcript_id ... OK
o gene_id ... OK
o transcript_biotype ... OK
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... OK
Generating index ... OK
-------------
Verifying validity of the information in the database:
Checking transcripts ... OK
Checking exons ... OK
Warning messages:
1: In ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism, :
I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!
2: In .getSeqlengthsFromMysqlFolder(organism = organism, ensembl = ensemblVersion, :
Could not determine length for all seqnames.
3: In UseMethod("seq") :
closing unused connection 3 (ftp://ftp.ensembl.org/pub/release-99/mysql/)
"""
edb <- EnsDb(sql)
The second warning is produced because it can't find the sequence length for the technical sequence I added to the transcriptome. Apart from that, it builds the ensembl database just fine and I seem to be able to use all the usual accessors to pull information (including all the attributes from the file I expected - gene_id, gene_biotype, and gene_name)
Alternatively, it might be simpler to just create a TxDb database from the GTF file and just tell tximeta that you are using a custom annotation and therefore do not bother specifying the original source (e.g. Ensembl) of the gene annotations. However, creating a TxDb object using makeTxDbFromGtf drops all the attribute information so it's not that helpful either.
Have you got a pipeline or function to add more annotations to the tximeta database? I wouldn't mind helping to update the database with special cases transcriptomes. For scRNA-seq I imagine the most common use cases are a combination of transcriptome, spike-in, and reporter?
Thanks James!
I'd like to now try and incorporate your code into my ensDbFromGtf
inside tximeta
. I think I will look into some kind of switch that determines that we are working with a GTF that needs manual specification of organism
, genomeVersion
and version
, but I can just keep track that this is a linkedTxome, for example. I'm traveling next week but will put on my list of TODOs for tximeta
.
Nice solution @jma1991 !
In general, if possible, I would however suggest to avoid the ensDbFromGtf
as much as possible: it lacks some annotations and the GTF file format is not that stable. I had to tweak the function already because Ensembl changed the format at some point and I got errors importing/extracting data from it. The best solutions, at least for standard releases, is to use the pre-built EnsDb
databases that are available in AnnotationHub
. @mikelove , let me know if you need some hand there (if you haven't already implemented that).
Thanks for the feedback @jorainer
re: AHub, yes, in devel, tximeta will now use the pre-built EnsDb's.
re: this particular case, we may need to use ensDbFromGtf
, because the reference txome is customized. It is the GTF from Ensembl plus some some additional sequences, which are added to the FASTA and to the GTF, and then @jma1991 is building a linkedTxome so that future users will be able to verify that they have the exact FASTA and GTF that describes the txome used for quantification. @jma1991's proposal seems like the best case, which I will incorporate once I get a chance.
@jma1991:
Can you try this new version of tximeta? I've split the EnsDb building step based on whether it is a linkedTxome:
0968dfbf63d6ae92d52122830450af238b5453ea
E.g. you can do:
devtools::install_github("mikelove/tximeta", dependencies=FALSE)
I think this is fixed, closing
I am trying to use tximeta (1.4.3) to import read counts from alevin (1.1.0) using a custom annotation (Ensembl + technical sequences) and a decoy-aware transcriptome.
I am able to make the linkedTxome JSON file and have populated it with all the required parameters:
However when I use tximeta an error from the ensembldb package stops the GTF file from being imported:
The GTF file I am using is not named according to the conventional Ensembl naming scheme. This is because the GTF file contains a number of additional technical sequences and I would like the filename to reflect these additions. Although these attributes cannot be parsed from the GTF filename, they can be found in the linkedTxome JSON file and could be provided directly to the various ensDbFrom* functions when needed?
# Edit
I am able to import the read counts when I change the source attribute in the linkedTxome from "Ensembl" to a string which isn't recognized i.e. "Custom". However the ranges generated from this method only include _geneid as a column whereas I expected all of the attributes in the GTF file to be imported, especially _genename. Should I revert back to tximport for this particular use-case?