thelovelab / tximeta

Transcript quantification import with automatic metadata detection
https://thelovelab.github.io/tximeta/
64 stars 11 forks source link

couldn't find matching transcriptome, returning non-ranged SummarizedExperiment (alignment mode) #72

Closed errcricket closed 1 year ago

errcricket commented 1 year ago

Greetings,

I am following this vignette: https://bioconductor.org/packages/release/bioc/vignettes/tximeta/inst/doc/tximeta.html#Pre-computed_checksums.

I am trying to use tximeta on data from salmon's alignment-based mode and I am getting a couldn't find matching transcriptome, returning non-ranged SummarizedExperiment error.

I believe I am using the most recent versions of Bioconductor and Tximeta. I am using tximeta 1.16.0 and BiocVersion 3.16.

> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.6 (Ootpa)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblaso-r0.3.15.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocGenerics_0.44.0 tximeta_1.16.0     

loaded via a namespace (and not attached):
  [1] ProtGenerics_1.30.0           bitops_1.0-7                  matrixStats_0.63.0            bit64_4.0.5                  
  [5] filelock_1.0.2                progress_1.2.2                httr_1.4.4                    GenomeInfoDb_1.34.3          
  [9] tools_4.2.1                   utf8_1.2.2                    R6_2.5.1                      DBI_1.1.3                    
 [13] lazyeval_0.2.2                withr_2.5.0                   tidyselect_1.2.0              prettyunits_1.1.1            
 [17] bit_4.0.5                     curl_4.3.3                    compiler_4.2.1                cli_3.4.1                    
 [21] Biobase_2.58.0                xml2_1.3.3                    DelayedArray_0.24.0           rtracklayer_1.58.0           
 [25] readr_2.1.3                   rappdirs_0.3.3                stringr_1.4.1                 digest_0.6.30                
 [29] Rsamtools_2.14.0              XVector_0.38.0                pkgconfig_2.0.3               htmltools_0.5.3              
 [33] MatrixGenerics_1.10.0         dbplyr_2.2.1                  fastmap_1.1.0                 ensembldb_2.20.2             
 [37] rlang_1.0.6                   rstudioapi_0.14               RSQLite_2.2.18                shiny_1.7.3                  
 [41] BiocIO_1.8.0                  generics_0.1.3                jsonlite_1.8.3                vroom_1.6.0                  
 [45] BiocParallel_1.32.1           dplyr_1.0.10                  RCurl_1.98-1.9                magrittr_2.0.3               
 [49] GenomeInfoDbData_1.2.9        Matrix_1.5-3                  Rcpp_1.0.9                    S4Vectors_0.36.0             
 [53] fansi_1.0.3                   lifecycle_1.0.3               stringi_1.7.8                 yaml_2.3.6                   
 [57] SummarizedExperiment_1.28.0   zlibbioc_1.44.0               BiocFileCache_2.4.0           AnnotationHub_3.4.0          
 [61] grid_4.2.1                    blob_1.2.3                    parallel_4.2.1                promises_1.2.0.1             
 [65] crayon_1.5.2                  lattice_0.20-45               Biostrings_2.66.0             GenomicFeatures_1.50.2       
 [69] hms_1.1.2                     KEGGREST_1.38.0               pillar_1.8.1                  GenomicRanges_1.50.1         
 [73] rjson_0.2.21                  codetools_0.2-18              biomaRt_2.52.0                stats4_4.2.1                 
 [77] XML_3.99-0.12                 glue_1.6.2                    BiocVersion_3.16.0            BiocManager_1.30.19          
 [81] tzdb_0.3.0                    png_0.1-7                     vctrs_0.5.1                   httpuv_1.6.6                 
 [85] purrr_0.3.5                   assertthat_0.2.1              cachem_1.0.6                  mime_0.12                    
 [89] xtable_1.8-4                  restfulr_0.0.15               AnnotationFilter_1.22.0       later_1.3.0                  
 [93] tibble_3.1.8                  GenomicAlignments_1.34.0      AnnotationDbi_1.60.0          memoise_2.0.1                
 [97] IRanges_2.32.0                tximport_1.26.0               ellipsis_0.3.2                interactiveDisplayBase_1.36.0

How I generated my data from paired fastq files...

# define variables
REFERENCE_ENSEMBLE_FASTA="reference_data/mouse/GRCm39/fasta/Mus_musculus.GRCm39.dna.primary_assembly.fa" 
REFERENCE_ENSEMBLE_TRANSCRIPT="reference_data/mouse/GRCm39/fasta/Mus_musculus.GRCm39.cdna.all.fa" 
REFERENCE_ENSEMBLE_CHR_GTF="reference_data/mouse/GRCm39/annotations/Mus_musculus.GRCm39.108.chr.gtf"
GFFREAD_TRANSCRIPT_FILE_CHR="reference_data/mouse/GRCm39/fasta/Mus_musculus.GRCm39.108.chr_GFFREAD.fa"
SALMON_TRANSCRIPTS_INDEX="reference_data/mouse/GRCm39/salmon_transcripts_index/"

# create transcript file
gffread -w $GFFREAD_TRANSCRIPT_FILE_CHR -g $REFERENCE_ENSEMBLE_FASTA $REFERENCE_ENSEMBLE_CHR_GTF

# create star index folder
STAR --runThreadN 20 --runMode genomeGenerate --genomeDir $STAR_INDEX_DIR_CHR --genomeFastaFiles $REFERENCE_ENSEMBLE_FASTA --sjdbGTFfile $REFERENCE_ENSEMBLE_CHR_GTF --sjdbOverhang 75

# align each fastq files (paired) to reference, make bam files
STAR --runThreadN 18 --genomeDir $STAR_INDEX_DIR_CHR --readFilesIn $fq1 $fq2 --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM GeneCounts --readFilesCommand zcat --outFileNamePrefix $aligned_sample_dir_chr --outReadsUnmapped Fastx

#create Salmon index
salmon index -t $REFERENCE_ENSEMBLE_FASTA -i $SALMON_TRANSCRIPTS_INDEX --decoys decoys.txt -k 31 -p 12

# quantify aligned bam files with salmon
salmon quant -t $GFFREAD_TRANSCRIPT_FILE_CHR -l A -p 10 -a $aligned_sample_dir_chr"Aligned.toTranscriptome.out.bam" -o $salmon_quant
suppressPackageStartupMessages(library('BiocVersion', lib.loc='/home/reichenbee/franco_lab/tools/R421'))

#reading in sample information
csvfile <- read.csv('experimental_groups.txt', header=T, sep='\t')

coldata <- csvfile
coldata$files <- file.path('salmon_quant_CHR', coldata$sample, 'quant.sf') #add file names/location
colnames(coldata)[1] <- 'names'
coldata <- coldata[14, ] # take one sample
coldata

#   names  group genotype                quant                         files
# 14   B69 cancer     XXXX salmon_quant_CHR/B69 salmon_quant_CHR/B69/quant.sf

se <- tximeta(coldata)

# importing quantifications
# reading in files with read_tsv
# 1 
# couldn't find matching transcriptome, returning non-ranged SummarizedExperiment

Based on this older post: https://github.com/mikelove/tximeta/issues/38, I went into one of the cmd_info.json files and placed "index" and path to the salmon index directory inside.

{
    "salmon_version": "1.9.0",
    "targets": "reference_data/mouse/GRCm39/fasta/Mus_musculus.GRCm39.108.chr_GFFREAD.fa",
    "libType": "A",
    "threads": "10",
    "alignments": "star_aligned_sample_files_CHR/B69/Aligned.toTranscriptome.out.bam",
    "output": "salmon_quant_CHR/B69/",
    "auxDir": "aux_info",
    "index": "salmon_transcripts_index/"
}

I re-ran it and received the same error message.

I did try adding hash strings as suggested, but I was not really sure how to organize them inside the aux_info/meta_info.json file and got errors like this:

Error in parse_con(txt, bigint_as_char) : 
  parse error: invalid object key (must be a string)
          bc4deb376dcf88a2e45878911e", } 
                     (right here) ------^

Aside from using an incorrect reference file, I am out of error origin ideas. Any information that can resolve this error is greatly appreciated.

mikelove commented 1 year ago

So if I'm reading correctly, the Ensembl-based index is not being used during quantification. I'm not sure it's a good idea to hack the metadata in this way. It looks like you want tximeta to see one index while it's not being used for quantification. This kind of breaks the idea of intrinsically connecting quantification output to the provenance of the metadata used for quantification. Potentially the downstream functions of tximeta won't be guaranteed to work or give correct output.

Maybe I'm misreading the code though.

I think an easier way to achieve your desired result would be skipMeta=TRUE which gives you an SE, and then just manually add the rowRanges() <- with the GRanges of transcripts?

errcricket commented 1 year ago

Thank you for the speedy response!

I doubt you are misreading my code. I have never used Tximeta before and I am using my own data while trying to work through the vignette (https://www.bioconductor.org/packages/devel/bioc/vignettes/tximeta/inst/doc/tximeta.html).

You are correct that the Ensembl-based index is not being used during quantification (as in mentioned in the alignment-based mode of Salmon).

I reverted back to the original json files and adding skipMeta=TRUE allowed the command to run. Thank you.

I am not sure how to handle your rowRanges() comment. Is there a way to automate this for alignment-based data so that I can continue to work through the vignette/tutorial?

mikelove commented 1 year ago

So, re: rowRanges(), this is entirely optional. This is if you want to have GRanges associated with your rows, but you don't need to have this for 99% of Bioconductor workflows, e.g. DE or gene set enrichment.

We automate this when we recognize the index used for quantification, but it's just a "nice to have", you can actually just skip this and work with the SE data.

errcricket commented 1 year ago

Thank you Mike.