thelovelab / tximeta

Transcript quantification import with automatic metadata detection
https://thelovelab.github.io/tximeta/
66 stars 11 forks source link

Error in checkAssays2Txps #66

Closed karlaarz closed 2 years ago

karlaarz commented 2 years ago

Hello,

I am trying to import some Salmon data using tximeta but I get the following error:

Error in checkAssays2Txps(assays, txps) : 
  none of the transcripts in the quantification files are in the GTF

My script goes as following:

indexDir <- file.path(dir, "salmon_idx_zebrafish")
fastaFTP <- "http://ftp.ensembl.org/pub/release-105/fasta/danio_rerio/cdna/Danio_rerio.GRCz11.cdna.all.fa.gz"
gtfPath <- "http://ftp.ensembl.org/pub/release-105/gtf/danio_rerio/Danio_rerio.GRCz11.105.gtf.gz"
makeLinkedTxome(indexDir=indexDir, source="Ensembl_FTP", organism="Danio rerio",
                release="105", genome="GRCz11", fasta=fastaFTP, gtf=gtfPath, write=FALSE)

se <- tximeta(coldata, type = "salmon", useHub=FALSE)
importing quantifications
reading in files with read_tsv
1 2 3 4 5
found matching linked transcriptome:
[ Ensembl_FTP - Danio rerio - release 105 ]
loading existing TxDb created: 2022-06-06 19:02:42
loading existing transcript ranges created: 2022-06-06 19:02:43
Error in checkAssays2Txps(assays, txps) :
 none of the transcripts in the quantification files are in the GTF

I am using tximeta v1.14.0 and the R version 4.2.0 (2022-04-22).

Any help would be appreciated.

Thanks

mikelove commented 2 years ago

Can you post the starting few lines of a quant.sf file and also a few lines from the Ensembl GTF. Sometimes Ensembl has a slightly different naming scheme between GTF and FASTA.

mikelove commented 2 years ago

Also in the meantime you can use skipMeta=TRUE if you don’t need the genomic ranges right now.

karlaarz commented 2 years ago

Hi Mike,

Sure:

head(quant)
                  Name Length EffectiveLength      TPM NumReads
1 ENSDART00000189431.1     11               1 0.000000        0
2 ENSDART00000189226.1     10               1 0.000000        0
3 ENSDART00000172037.2    344              94 0.000000        0
4 ENSDART00000165410.2    350             100 0.462361        1
5 ENSDART00000163675.2    339              89 0.000000        0
6 ENSDART00000172374.2    355             105 0.000000        0

head(gtf)
#!genome-build GRCz11
#!genome-version GRCz11
#!genome-date 2017-05
#!genome-build-accession GCA_000002035.4
#!genebuild-last-updated 2018-04
4   havana  gene    30402837    30403763    .   +   .   gene_id "ENSDARG00000103202"; gene_version "2"; gene_name "CR383668.1"; gene_source "havana"; gene_biotype "lincRNA";
4   havana  transcript  30402837    30403763    .   +   .   gene_id "ENSDARG00000103202"; gene_version "2"; transcript_id "ENSDART00000159919"; transcript_version "2"; gene_name "CR383668.1"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "CR383668.1-201"; transcript_source "havana"; transcript_biotype "lincRNA";
4   havana  exon    30402837    30402893    .   +   .   gene_id "ENSDARG00000103202"; gene_version "2"; transcript_id "ENSDART00000159919"; transcript_version "2"; exon_number "1"; gene_name "CR383668.1"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "CR383668.1-201"; transcript_source "havana"; transcript_biotype "lincRNA"; exon_id "ENSDARE00001204173"; exon_version "1";
4   havana  exon    30403203    30403350    .   +   .   gene_id "ENSDARG00000103202"; gene_version "2"; transcript_id "ENSDART00000159919"; transcript_version "2"; exon_number "2"; gene_name "CR383668.1"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "CR383668.1-201"; transcript_source "havana"; transcript_biotype "lincRNA"; exon_id "ENSDARE00001194706"; exon_version "1";
4   havana  exon    30403546    30403763    .   +   .   gene_id "ENSDARG00000103202"; gene_version "2"; transcript_id "ENSDART00000159919"; transcript_version "2"; exon_number "3"; gene_name "CR383668.1"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "CR383668.1-201"; transcript_source "havana"; transcript_biotype "lincRNA"; exon_id "ENSDARE00001199782"; exon_version "1"

head(fasta)
>ENSDART00000189431.1 cdna chromosome:GRCz11:2:36087769:36087779:1 gene:ENSDARG00000116509.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:BX681417.25
GATTGGGGTAC
>ENSDART00000189226.1 cdna chromosome:GRCz11:2:36088047:36088056:1 gene:ENSDARG00000116470.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:BX681417.24
TCTGGACTAC
>ENSDART00000172037.2 cdna chromosome:GRCz11:2:31866722:31867190:-1 gene:ENSDARG00000101672.2 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:trgv7 description:T cell receptor gamma variable 7 [Source:ZFIN;Acc:ZDB-GENE-051115-9]
ATGAGCCTTCAAATGATCTTGTTTTTCTTTCTTTTATATAGAGTTGATGGACAAGCGATG
CTGCGACAGAAAATATCCTCAACCAAATCTCAGGACAAGACTGTTGTCATAGACTGTGAT
TACCCTTCAGACTGTTATAGGTACATCCACTGGTACCAACTAAAAGGACAAACCTTAAAG
AGAATATTATATGCACAAATTTCAGGAGGAGAACCAGCCAGAGATGCTGGTTTTGAATTG
TTTAAAATAGACCGTAAACAGTCAAATATTGCTCTGAAAATACCTGAACTGAAAACAGAG

I see that there is a difference between the transcript names. The GTF doesn't have the transcript version that the FASTA and Salmon's output do.

If I add the skipMeta=TRUE it works.

mikelove commented 2 years ago

I think if you specify ignoreTxVersion = TRUE it may also be able to connect the FASTA to GTF.

tximeta/tximport don't do any guessing of the matches (just because there are so many different sources, and we don't want to make a mistake by assuming any bit of an identifier is insignificant). But we do have some options to help deal with inconsitencies in the source files.

karlaarz commented 2 years ago

Hi Mike, yes, by adding ignoreTxVersion = TRUE and skipMeta=TRUE options now it runs smoothly. Thanks for the help!