Modified Ensembl GTF parsing error

jma1991 commented 4 years ago

I am trying to use tximeta (1.4.3) to import read counts from alevin (1.1.0) using a custom annotation (Ensembl + technical sequences) and a decoy-aware transcriptome.

I am able to make the linkedTxome JSON file and have populated it with all the required parameters:

makeLinkedTxome(
    indexDir = "results/salmon/index/GRCm38.p6.EGFP",
    source = "Ensembl",
    organism = "Mus musculus",
    release = "99",
    genome = "GRCm38.p6.EGFP",
    fasta = "results/gffread/GRCm38.p6.EGFP/GRCm38.p6.EGFP.gentrome.fa",
    gtf = "results/genomepy/GRCm38.p6.EGFP/GRCm38.p6.EGFP.annotation.gtf",
    write = TRUE,
    jsonFile = "results/tximeta/GRCm38.p6.EGFP.json"
)

However when I use tximeta an error from the ensembldb package stops the GTF file from being imported:

loadLinkedTxome("results/tximeta/GRCm38.p6.EGFP.json")
dat <- data.frame(files =  "results/salmon/alevin/sample1/alevin/quants_mat.gz", names = "sample1", stringsAsFactors = FALSE)
rse <- tximeta(dat, type = "alevin")
importing quantifications
importing alevin data is much faster after installing `fishpond` (>= 1.1.18)
reading in alevin gene-level counts across cells
found matching linked transcriptome:
[ Ensembl - Mus musculus - release 99 ]
building EnsDb with 'ensembldb' package
Importing GTF file ... OK
Error in .checkExtractVersions(gtf, organism, genomeVersion, version) :
  The file name does not match the expected naming scheme of Ensembl files (<organism>.<genome version>.<Ensembl version>) hence I cannot extract any information from it! Parameters 'organism', 'genomeVersion' and 'version' are thus required!

The GTF file I am using is not named according to the conventional Ensembl naming scheme. This is because the GTF file contains a number of additional technical sequences and I would like the filename to reflect these additions. Although these attributes cannot be parsed from the GTF filename, they can be found in the linkedTxome JSON file and could be provided directly to the various ensDbFrom* functions when needed?

# Edit

I am able to import the read counts when I change the source attribute in the linkedTxome from "Ensembl" to a string which isn't recognized i.e. "Custom". However the ranges generated from this method only include _geneid as a column whereas I expected all of the attributes in the GTF file to be imported, especially _genename. Should I revert back to tximport for this particular use-case?

mikelove commented 4 years ago

Thanks for reporting this. Can you provide some code that would enable import using ensembldb functions?

Another option would be to ask the ensembldb maintainer @jorainer how this case should be handled as they have constructed it to only import GTF as provided by Ensembl I suppose. We could do something inside tximeta to workaround or they may have something else in mind.

jma1991 commented 4 years ago

Here is my attempt at parsing the index data and providing this to the ensembldb function:

pkg <- c("ensembldb", "jsonlite")

lib <- lapply(pkg, library, character.only = TRUE)

idx <- fromJSON("results/tximeta/GRCm38.p6.EGFP.json")

sql <- ensDbFromGtf(
    gtf = "results/genomepy/GRCm38.p6.EGFP/GRCm38.p6.EGFP.annotation.gtf",
    outfile = tempfile(),
    organism = idx$organism,
    genomeVersion = idx$genome,
    version = idx$release
)

"""
Importing GTF file ... OK
Processing metadata ... OK
Processing genes ...
 Attribute availability:
  o gene_id ... OK
  o gene_name ... OK
  o entrezid ... Nope
  o gene_biotype ... OK
OK
Processing transcripts ...
 Attribute availability:
  o transcript_id ... OK
  o gene_id ... OK
  o transcript_biotype ... OK
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... OK
Generating index ... OK
  -------------
Verifying validity of the information in the database:
Checking transcripts ... OK
Checking exons ... OK
Warning messages:
1: In ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism,  :
   I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!
2: In .getSeqlengthsFromMysqlFolder(organism = organism, ensembl = ensemblVersion,  :
  Could not determine length for all seqnames.
3: In UseMethod("seq") :
  closing unused connection 3 (ftp://ftp.ensembl.org/pub/release-99/mysql/)
"""

edb <- EnsDb(sql)

The second warning is produced because it can't find the sequence length for the technical sequence I added to the transcriptome. Apart from that, it builds the ensembl database just fine and I seem to be able to use all the usual accessors to pull information (including all the attributes from the file I expected - gene_id, gene_biotype, and gene_name)

Alternatively, it might be simpler to just create a TxDb database from the GTF file and just tell tximeta that you are using a custom annotation and therefore do not bother specifying the original source (e.g. Ensembl) of the gene annotations. However, creating a TxDb object using makeTxDbFromGtf drops all the attribute information so it's not that helpful either.

Have you got a pipeline or function to add more annotations to the tximeta database? I wouldn't mind helping to update the database with special cases transcriptomes. For scRNA-seq I imagine the most common use cases are a combination of transcriptome, spike-in, and reporter?

mikelove commented 4 years ago

Thanks James!

I'd like to now try and incorporate your code into my ensDbFromGtf inside tximeta. I think I will look into some kind of switch that determines that we are working with a GTF that needs manual specification of organism, genomeVersion and version, but I can just keep track that this is a linkedTxome, for example. I'm traveling next week but will put on my list of TODOs for tximeta.

jorainer commented 4 years ago

Nice solution @jma1991 !

In general, if possible, I would however suggest to avoid the ensDbFromGtf as much as possible: it lacks some annotations and the GTF file format is not that stable. I had to tweak the function already because Ensembl changed the format at some point and I got errors importing/extracting data from it. The best solutions, at least for standard releases, is to use the pre-built EnsDb databases that are available in AnnotationHub. @mikelove , let me know if you need some hand there (if you haven't already implemented that).

mikelove commented 4 years ago

Thanks for the feedback @jorainer

re: AHub, yes, in devel, tximeta will now use the pre-built EnsDb's.

re: this particular case, we may need to use ensDbFromGtf, because the reference txome is customized. It is the GTF from Ensembl plus some some additional sequences, which are added to the FASTA and to the GTF, and then @jma1991 is building a linkedTxome so that future users will be able to verify that they have the exact FASTA and GTF that describes the txome used for quantification. @jma1991's proposal seems like the best case, which I will incorporate once I get a chance.

mikelove commented 4 years ago

@jma1991:

Can you try this new version of tximeta? I've split the EnsDb building step based on whether it is a linkedTxome:

0968dfbf63d6ae92d52122830450af238b5453ea

E.g. you can do:

devtools::install_github("mikelove/tximeta", dependencies=FALSE)

mikelove commented 4 years ago

I think this is fixed, closing

thelovelab / tximeta

Modified Ensembl GTF parsing error #29