mritchielab / FLAMES

A framework for performing single-cell and bulk read full-length analysis of mutations and splicing.
https://mritchielab.github.io/FLAMES/
GNU General Public License v3.0
14 stars 9 forks source link

Error: _Map_base::at #31

Open nick-youngblut opened 5 months ago

nick-youngblut commented 5 months ago

During the Reading Gene Annotations step in the sc_long_pipeline() workflow, I'm getting a _Map_base::at error.

I'm using a decompressed fastq for the BLAZE output (ran that stand-alone). I'm using GCF_000001405.40_GRCh38.p14 for the reference: (GCF_000001405.40_GRCh38.p14_genomic.fna.gz and GCF_000001405.40_GRCh38.p14_genomic.gtf.gz).

My config:

config_file = FLAMES::create_config(
  config_dir,
  type = "sc_3end",
  do_barcode_demultiplex = FALSE,
  threads = 12
)

My sc_long_pipeline() job:

sce = FLAMES::sc_long_pipeline(
    fastq = fastq_input,
    genome_fa = ref_fna_input,
    annotation = ref_annot_input,
    outdir = work_dir,
    config = config_file,
    minimap2 = minimap2_path,
    k8 = k8_path,
    expect_cell_number = 8000
)

My console output:

Skipping Demultiplexing step...
Please make sure the ` /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//blaze_output/matched_reads.fastq `` is the the demultiplexing output from previous FLAEMS call.
Running FLAMES pipeline...
#### Input parameters:
{
  "pipeline_parameters": {
    "seed": [2022],
    "threads": [12],
    "do_barcode_demultiplex": [false],
    "do_gene_quantification": [true],
    "do_genome_alignment": [true],
    "do_isoform_identification": [true],
    "bambu_isoform_identification": [false],
    "multithread_isoform_identification": [true],
    "do_read_realignment": [true],
    "do_transcript_quantification": [true]
  },
  "barcode_parameters": {
    "max_bc_editdistance": [2],
    "max_flank_editdistance": [8],
    "pattern": {
      "primer": ["CTACACGACGCTCTTCCGATCT"],
      "BC": ["NNNNNNNNNNNNNNNN"],
      "UMI": ["NNNNNNNNNNNN"],
      "polyT": ["TTTTTTTTT"]
    },
    "TSO_seq": ["CCCATGTACTCTGCGTTGATACCACTGCTT"],
    "TSO_prime": [3],
    "full_length_only": [false]
  },
  "isoform_parameters": {
    "generate_raw_isoform": [false],
    "max_dist": [10],
    "max_ts_dist": [100],
    "max_splice_match_dist": [10],
    "min_fl_exon_len": [40],
    "max_site_per_splice": [3],
    "min_sup_cnt": [5],
    "min_cnt_pct": [0.001],
    "min_sup_pct": [0.2],
    "bambu_trust_reference": [true],
    "strand_specific": [0],
    "remove_incomp_reads": [4],
    "downsample_ratio": [1]
  },
  "alignment_parameters": {
    "use_junctions": [true],
    "no_flank": [false]
  },
  "realign_parameters": {
    "use_annotation": [true]
  },
  "transcript_counting": {
    "min_tr_coverage": [0.4],
    "min_read_coverage": [0.4]
  }
} 
gene annotation: /home/rstudio/workspace//data/references/human/FLAMES/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz 
genome fasta: /home/rstudio/workspace//data/references/human/FLAMES/GCF_000001405.40_GRCh38.p14_genomic.fna.gz 
input fastq: /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//blaze_output/matched_reads.fastq 
output directory: /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//flames 
minimap2 path: /home/rstudio/miniconda3/bin/minimap2 
k8 path: /home/rstudio/miniconda3/bin/k8 
#### Aligning reads to genome using minimap2
02:49:30 PM Fri Apr 12 2024 minimap2_align
[M::mm_idx_gen::52.772*2.11] collected minimizers
[M::mm_idx_gen::59.234*3.06] sorted minimizers
[M::main::59.234*3.06] loaded/built the index for 705 target sequence(s)
[M::mm_mapopt_update::60.414*3.02] mid_occ = 2177
[M::mm_idx_stat] kmer size: 14; skip: 5; is_hpc: 0; #seq: 705
[M::mm_idx_stat::61.052*3.00] distinct minimizers: 65648619 (22.00% are singletons); average occurrences: 16.258; average spacing: 3.090; total length: 3298430636
[M::worker_pipeline::305.468*9.91] mapped 482699 sequences
[M::worker_pipeline::539.678*10.81] mapped 485246 sequences
[M::worker_pipeline::772.824*11.17] mapped 494113 sequences
[M::worker_pipeline::1008.325*11.36] mapped 492944 sequences
[M::worker_pipeline::1250.776*11.48] mapped 487522 sequences
[M::worker_pipeline::1515.925*11.57] mapped 468694 sequences
[M::worker_pipeline::1778.447*11.64] mapped 468982 sequences
[M::worker_pipeline::2041.090*11.68] mapped 468921 sequences
[M::worker_pipeline::2305.636*11.72] mapped 468068 sequences
[M::worker_pipeline::2566.991*11.75] mapped 469711 sequences
[M::worker_pipeline::2814.316*11.77] mapped 476070 sequences
[M::worker_pipeline::3044.860*11.79] mapped 482292 sequences
[M::worker_pipeline::3277.376*11.80] mapped 482075 sequences
[M::worker_pipeline::3507.651*11.81] mapped 482633 sequences
[M::worker_pipeline::3738.925*11.82] mapped 482293 sequences
[M::worker_pipeline::3910.235*11.82] mapped 377902 sequences
[M::main] Version: 2.28-r1209
[M::main] CMD: /home/rstudio/miniconda3/bin/minimap2 -ax splice -t 12 -k14 --secondary=no --seed 2022 --junc-bed /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//flames/tmp_splice_anno.bed12 --junc-bonus 1 /home/rstudio/workspace//data/references/human/FLAMES/GCF_000001405.40_GRCh38.p14_genomic.fna.gz /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//blaze_output/matched_reads.fastq
[M::main] Real time: 3910.908 sec; CPU: 46208.963 sec; Peak RSS: 21.147 GB
[bam_sort_core] merging from 17 files and 1 in-memory blocks...
04:02:56 PM Fri Apr 12 2024 Start gene quantification and UMI deduplication
04:02:56 PM Fri Apr 12 2024 quantify genes 
Found genome alignment file(s):     align2genome.bam
Connected to your session in progress, last started 2024-Apr-12 14:48:40 UTC (1 hour ago)
Assigning reads to genes...
Processed: 100%|██████████| 40375/40375 [02:41<00:00, 249.45gene_group/s]
Finalising the gene count matrix ...
Plotting the saturation curve ...
Generating deduplicated fastq file ...
Processed: 7572500.0Read [01:27, 87015.60Read/s] 
04:14:35 PM Fri Apr 12 2024 Gene quantification and UMI deduplication done!
04:14:35 PM Fri Apr 12 2024 Start isoform identificaiton
04:14:35 PM Fri Apr 12 2024 find_isoform
#### Reading Gene Annotations
Error: _Map_base::at

I'm using the patched version of FLAMES from https://github.com/mritchielab/FLAMES/issues/26.

My sessionInfo:

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] FLAMES_1.9.2

loaded via a namespace (and not attached):
  [1] BiocIO_1.12.0               bitops_1.0-7                filelock_1.0.2             
  [4] tibble_3.2.1                R.oo_1.25.0                 basilisk.utils_1.14.1      
  [7] bambu_3.4.0                 graph_1.80.0                XML_3.99-0.14              
 [10] rpart_4.1.19                lifecycle_1.0.3             edgeR_4.0.16               
 [13] doParallel_1.0.17           OrganismDbi_1.44.0          globals_0.16.2             
 [16] lattice_0.21-8              ensembldb_2.26.0            MultiAssayExperiment_1.28.0
 [19] backports_1.4.1             magrittr_2.0.3              limma_3.58.1               
 [22] Hmisc_5.1-1                 rmarkdown_2.25              yaml_2.3.7                 
 [25] metapod_1.10.1              reticulate_1.34.0           ggbio_1.50.0               
 [28] cowplot_1.1.1               DBI_1.1.3                   RColorBrewer_1.1-3         
 [31] abind_1.4-5                 zlibbioc_1.48.2             GenomicRanges_1.54.1       
 [34] purrr_1.0.2                 R.utils_2.12.3              AnnotationFilter_1.26.0    
 [37] biovizBase_1.50.0           BiocGenerics_0.48.1         RCurl_1.98-1.12            
 [40] nnet_7.3-19                 VariantAnnotation_1.48.1    rappdirs_0.3.3             
 [43] circlize_0.4.15             GenomeInfoDbData_1.2.11     IRanges_2.36.0             
 [46] S4Vectors_0.40.2            ggrepel_0.9.4               irlba_2.3.5.1              
 [49] listenv_0.9.0               dqrng_0.3.1                 parallelly_1.36.0          
 [52] DelayedMatrixStats_1.24.0   codetools_0.2-19            DropletUtils_1.22.0        
 [55] DelayedArray_0.28.0         scuttle_1.12.0              xml2_1.3.5                 
 [58] tidyselect_1.2.0            shape_1.4.6                 viridis_0.6.4              
 [61] ScaledMatrix_1.10.0         matrixStats_1.0.0           stats4_4.3.1               
 [64] BiocFileCache_2.10.2        base64enc_0.1-3             GenomicAlignments_1.38.2   
 [67] jsonlite_1.8.7              BiocNeighbors_1.20.2        GetoptLong_1.0.5           
 [70] Formula_1.2-5               scater_1.30.1               iterators_1.0.14           
 [73] foreach_1.5.2               tools_4.3.1                 progress_1.2.2             
 [76] Rcpp_1.0.11                 glue_1.7.0                  gridExtra_2.3              
 [79] SparseArray_1.2.4           xfun_0.40                   MatrixGenerics_1.14.0      
 [82] GenomeInfoDb_1.38.8         dplyr_1.1.4                 HDF5Array_1.30.1           
 [85] withr_2.5.1                 BiocManager_1.30.22         fastmap_1.1.1              
 [88] GGally_2.1.2                basilisk_1.14.3             bluster_1.12.0             
 [91] rhdf5filters_1.14.1         fansi_1.0.5                 rsvd_1.0.5                 
 [94] digest_0.6.33               R6_2.5.1                    colorspace_2.1-0           
 [97] dichromat_2.0-0.1           biomaRt_2.58.2              RSQLite_2.3.2              
[100] R.methodsS3_1.8.2           utf8_1.2.4                  tidyr_1.3.1                
[103] generics_0.1.3              data.table_1.14.8           rtracklayer_1.62.0         
[106] prettyunits_1.2.0           httr_1.4.7                  htmlwidgets_1.6.2          
[109] S4Arrays_1.2.1              pkgconfig_2.0.3             gtable_0.3.4               
[112] blob_1.2.4                  ComplexHeatmap_2.18.0       SingleCellExperiment_1.24.0
[115] XVector_0.42.0              htmltools_0.5.6.1           RBGL_1.78.0                
[118] ProtGenerics_1.34.0         clue_0.3-65                 scales_1.3.0               
[121] Biobase_2.62.0              png_0.1-8                   scran_1.30.2               
[124] knitr_1.44                  rstudioapi_0.15.0           reshape2_1.4.4             
[127] rjson_0.2.21                checkmate_2.3.0             curl_5.1.0                 
[130] cachem_1.0.8                rhdf5_2.46.1                GlobalOptions_0.1.2        
[133] stringr_1.5.1               vipor_0.4.5                 parallel_4.3.1             
[136] foreign_0.8-84              AnnotationDbi_1.64.1        restfulr_0.0.15            
[139] pillar_1.9.0                grid_4.3.1                  reshape_0.8.9              
[142] vctrs_0.6.4                 BiocSingular_1.18.0         dbplyr_2.4.0               
[145] beachmat_2.18.1             cluster_2.1.4               beeswarm_0.4.0             
[148] htmlTable_2.4.2             evaluate_0.22               GenomicFeatures_1.54.4     
[151] cli_3.6.1                   locfit_1.5-9.8              compiler_4.3.1             
[154] Rsamtools_2.18.0            rlang_1.1.1                 crayon_1.5.2               
[157] ggbeeswarm_0.7.2            plyr_1.8.9                  stringi_1.7.12             
[160] viridisLite_0.4.2           BiocParallel_1.36.0         munsell_0.5.0              
[163] Biostrings_2.70.3           lazyeval_0.2.2              Matrix_1.6-5               
[166] dir.expiry_1.10.0           BSgenome_1.70.2             hms_1.1.3                  
[169] sparseMatrixStats_1.14.0    bit64_4.0.5                 future_1.33.0              
[172] ggplot2_3.5.0               Rhdf5lib_1.24.2             KEGGREST_1.42.0            
[175] statmod_1.5.0               SummarizedExperiment_1.32.0 igraph_1.5.1               
[178] memoise_2.0.1               bit_4.0.5                   xgboost_1.7.5.1
nick-youngblut commented 5 months ago

Note: I'm using the gtf file instead of gff based on https://github.com/mritchielab/FLAMES/issues/27

ChangqingW commented 5 months ago

Seems to be a bug in the new Rcpp implementation. I have not been able to locate the bug, but you could edit the config file to either switch multithread_isoform_identification to false (to use the old python implementation) or switch bambu_isoform_identification to true (to use bambu instead)

nick-youngblut commented 5 months ago

If I use bambu_isoform_identification = TRUE, then I get the following error:

Error in check_arguments(annotation, fastq, genome_bam, outdir, genome_fa,  : 
  Bambu requires GTF format for annotation file.

I'm using a GTF file for the annotation input, but it is gzip'd, which I'm guessing is causing the error.

ChangqingW commented 4 months ago

Hmm, bambu can probably read gziped annotation but the custom parser in our legacy code won't. I'll specify that in the docs.

For the multithread_isoform_identification implementation I got a Error: unordered_map::at error with plain GTF but there seems to a ton of unordered_map calls in the functions and I cannot locate where the issue is at the moment

nick-youngblut commented 4 months ago

Moreover, I get the following error even when using an uncompressed gtf file:

[E::fai_build3_core] Cannot index files compressed with gzip, please use bgzip

It appears that the genome_fa must be bgzip'd or uncompressed. Do you think that you will just create uncompressed temporary files for these reference files, at least when the user selects bambu?

nick-youngblut commented 3 months ago

error with plain GTF but there seems to a ton of unordered_map calls in the functions and I cannot locate where the issue is at the moment

Any updates on this?

Same for the please use bgzip error: is it necessary for FLAMES to create a temporary uncompressed (or bgzip-compressed) file when the user selects bambu?

ChangqingW commented 3 months ago

Same for the please use bgzip error: is it necessary for FLAMES to create a temporary uncompressed (or bgzip-compressed) file when the user selects bambu?

Sounds reasonable, I think it assumes everything is uncompressed at the moment

maxim-h commented 2 months ago

Some printf debugging led me to believe the problem is coming from the get_gene_blocks function around here, because not all genes present in chr_to_gene map are present in the gene_dict map for some reason.

maxim-h commented 1 month ago

@ChangqingW So, in my case the issue is coming from the get_gene_blocks as I described before. The gene it's failing at is a small mitochondrial gene with only 1 transcript containing only 1 exon.

Looks like it's missing from gene_dict because it's also not present in the gene_to_transcript map after remove_similar_tr was ran. Pretty sure on this line:

https://github.com/mritchielab/FLAMES/blob/8510bdb9d40b87dcaa762947505f1be4560a0765/src/classes/junctions.cpp#L101

the whole outer loop iteration is skipped, including this block

https://github.com/mritchielab/FLAMES/blob/8510bdb9d40b87dcaa762947505f1be4560a0765/src/classes/junctions.cpp#L126-L133

that is supposed to add the gene to the new out_gene_to_transcript map.

I'm not quite sure if the intention behind that continue was to skip those genes entirely or to only skip the isoform comparison. If it's the former you'd need to add some condition downstream that skips the genes not present in gene_to_transcript and gene_dict, and if the latter you need to change the continue statement to only apply to the inner loop.

OliverVoogd commented 1 month ago

@maxim-h Thanks for your suggestions! It looks like remove_similar_tr was ignoring genes with only 1 transcript, which was not the intended behaviour. I've updated the relevant function in 6412113f5772cc9c2bff1a35f6f18a73ef2c89d1