thelovelab / tximeta

Transcript quantification import with automatic metadata detection
https://thelovelab.github.io/tximeta/
64 stars 11 forks source link

none of the transcripts in the quantification files are in the GTF #54

Closed ibekere closed 3 years ago

ibekere commented 3 years ago

Hi,

I'm using tximport linkedTxome to record information about the FASTA (transcriptome) and GTF files for further analysis of Salmon quantified files with DESeq2. I made an index based on Arabidopsis thaliana transcriptome and then quantified reads with Salmon, which went fine. When importing quantifications with tximeta I got an error:

se <- tximeta(coldata)
importing quantifications
reading in files with read_tsv
1 
couldn't find matching transcriptome, returning non-ranged SummarizedExperiment 

So I used _linkedTxome_:
indexDir <- "athal_index/"
gtfFTP <- "ftp://ftp.ensemblgenomes.org/pub/plants/release-49/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.49.gtf.gz"
fastaFTP <- "ftp://ftp.ensemblgenomes.org/pub/plants/release-49/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz"

makeLinkedTxome(indexDir=indexDir,
                source="Ensembl",
                organism="Arabidopsis thaliana",
                release="49",
                genome="TAIR10",
                fasta=fastaFTP,
                gtf=gtfFTP,
                write=TRUE)

se <- tximeta(coldata)
importing quantifications
reading in files with read_tsv
1 
found matching linked transcriptome:
[ Ensembl - Arabidopsis thaliana - release 49 ]
useHub=TRUE: checking for EnsDb via 'AnnotationHub'
snapshotDate(): 2020-10-27
did not find matching EnsDb via 'AnnotationHub'
building EnsDb with 'ensembldb' package
Importing GTF file ... trying URL 'ftp://ftp.ensemblgenomes.org/pub/plants/release-49/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.49.gtf.gz'
downloaded 9.5 MB

OK
Processing metadata ... OK
Processing genes ... 
 Attribute availability:
  o gene_id ... OK
  o gene_name ... OK
  o entrezid ... Nope
  o gene_biotype ... OK
OK
Processing transcripts ... 
 Attribute availability:
  o transcript_id ... OK
  o gene_id ... OK
  o transcript_biotype ... OK
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... OK
Generating index ... OK
  -------------
Verifying validity of the information in the database:
Checking transcripts ... OK
Checking exons ... OK
generating transcript ranges
Error in checkAssays2Txps(assays, txps) : 
  none of the transcripts in the quantification files are in the GTF
In addition: Warning messages:
1: In for (j in seq_along(x)) { :
  closing unused connection 7 (ftp://ftp.ensemblgenomes.org/pub/release-49/plants/mysql/)
2: In for (j in seq_along(x)) { :
  closing unused connection 6 (ftp://ftp.ensemblgenomes.org/pub/release-49/metazoa/mysql/)
3: In for (j in seq_along(x)) { :
  closing unused connection 5 (ftp://ftp.ensemblgenomes.org/pub/release-49/fungi/mysql/)
4: In for (j in seq_along(x)) { :
  closing unused connection 4 (ftp://ftp.ensemblgenomes.org/pub/release-49/bacteria/mysql/)
5: In for (j in seq_along(x)) { :
  closing unused connection 3 (ftp://ftp.ensembl.org/pub/release-49/mysql/)

The error says that none of the transcripts in the quantification files are in the GTF, however that does not seem to be the case as I find matching transcript IDs in GTF and quant.sf files: GTF file:

1   araport11   gene    3631    5899    .   +   .   gene_id "AT1G01010"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding";
1   araport11   transcript  3631    5899    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

quant.sf file:

Name    Length  EffectiveLength TPM NumReads
AT1G19090.1 2509    2317.387    0.000000    0.000
AT1G18320.1 699 507.402 0.926845    5.123
AT5G11100.1 1709    1517.387    2.540949    42.000
AT4G35335.1 1280    1088.387    12.736102   151.000
AT1G60930.1 3876    3684.387    0.597983    24.000

Any ideas how to solve this?

mikelove commented 3 years ago

Thanks for the report.

Can you try again, changing the source to a string other than Ensembl when you run makeLinkedTxome?

I think a rule we have in tximeta for processing Ensembl transcripts for other organisms doesn't work for Arabidopsis thaliana.

ibekere commented 3 years ago

I tried RefSeq as the source and got a different error:

indexDir <- "athal_index/" gtfFTP <- "https://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Arabidopsis_thaliana/latest_assembly_versions/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.gtf.gz" fastaFTP <- "https://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Arabidopsis_thaliana/latest_assembly_versions/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_rna.fna.gz"

makeLinkedTxome(indexDir=indexDir, source="RefSeq", organism="Arabidopsis thaliana", release="GCF_000001735.4", genome="TAIR10.1", fasta=fastaFTP, gtf=gtfFTP, write=TRUE)

se <- tximeta(coldata)

importing quantifications reading in files with read_tsv 1 found matching linked transcriptome: [ RefSeq - Arabidopsis thaliana - release GCF_000001735.4 ] Error in getTxDb(txomeInfo, useHub = useHub) : object 'hubWorked' not found

mikelove commented 3 years ago

Oh sorry, I meant can you use a string different than source, e.g.:

makeLinkedTxome(indexDir=indexDir,
source="Ensembl_FTP",
organism="Arabidopsis thaliana",
release="49",
genome="TAIR10",
fasta=fastaFTP,
gtf=gtfFTP,
write=TRUE)

The RefSeq error was something else (I just fixed and pushed to github and Bioc devel branch.

JudithR commented 3 years ago

I tried this to see if it would fix my issue, but I get the same error as ibekere:

se_s1 <- tximeta(samples_s1, useHub=FALSE)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
found matching linked transcriptome:
[ Ensembl_test - Parus major - release 97 ]
Error in getTxDb(txomeInfo, useHub = useHub) :
  object 'hubWorked' not found
> se_s1 <- tximeta(samples_s1)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
found matching linked transcriptome:
[ Ensembl_test - Parus major - release 97 ]
Error in getTxDb(txomeInfo, useHub = useHub) :
  object 'hubWorked' not found
mikelove commented 3 years ago

Can you report what version of tximeta you are using?

JudithR commented 3 years ago

Apologies, forgot to link to my other post. R 4.01, tximeta_1.8.4, Bioconductor version 3.12 (BiocManager 1.30.12)

mikelove commented 3 years ago

Ah, I fixed this in devel but not in release.

Can you see if the devel branch works for you:

devtools::install_github("mikelove/tximeta")

I'll also push this fix to release in the meantime.

JudithR commented 3 years ago

devel works. It makes the transcriptome cache in /tmp ;-)

mikelove commented 3 years ago

Great, and FWIW 1.8.5 is pushed to release and should appear in 1-2 days.