mritchielab / FLAMES

A framework for performing single-cell and bulk read full-length analysis of mutations and splicing.
GNU General Public License v3.0
11 stars 6 forks source link

find_isoform error #33

Open nick-youngblut opened 1 month ago

nick-youngblut commented 1 month ago

I'm just using 3 input fastq files with 10k reads each, as just a test.

My workflow:

config_file = FLAMES::create_config(outdir, type = "sc_3end", do_barcode_demultiplex = TRUE)

sce = sc_long_pipeline(
    fastq = fastq_dir, 
    annotation = ref_gtf_file, 
    genome_fa = ref_fasta_file,
    outdir = outdir, 
    config_file = config_file, 
    expect_cell_number = 8000
)

The error:

'OR4G4P'Traceback:

1. sc_long_pipeline(fastq = fastq_dir, annotation = ref_gtf_file, 
 .     genome_fa = ref_fasta_file, outdir = outdir, config_file = config_file, 
 .     expect_cell_number = 8000)
2. find_isoform(annotation, genome_fa, genome_bam, outdir, config)
3. find_isoform_flames(annotation, genome_fa, genome_bam, outdir, 
 .     config)
4. basiliskRun(env = flames_env, fun = function(gff3, genome, iso, 
 .     tss, fa, tran, ds, conf, raw) {
 .     python_path <- system.file("python", package = "FLAMES")
 .     find <- reticulate::import_from_path("find_isoform", python_path)
 .     ret <- find$find_isoform(gff3, genome, iso, tss, fa, tran, 
 .         ds, conf, raw)
 .     ret
 . }, gff3 = annotation, genome = genome_bam, iso = file.path(outdir, 
 .     "isoform_annotated.gff3"), tss = file.path(outdir, "tss_tes.bedgraph"), 
 .     fa = genome_fa, tran = file.path(outdir, "transcript_assembly.fa"), 
 .     ds = config$isoform_parameters$downsample_ratio, conf = config, 
 .     raw = ifelse(config$isoform_parameters$generate_raw_isoform, 
 .         file.path(outdir, "splice_raw.gff3"), FALSE))
5. fun(...)
6. find$find_isoform(gff3, genome, iso, tss, fa, tran, ds, conf, 
 .     raw)
7. py_call_impl(callable, call_args$unnamed, call_args$named)

My references:

sessionInfo:

R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS/LAPACK: /home/nickyoungblut/miniforge3/envs/flames/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] FLAMES_1.8.0 tidyr_1.3.1  dplyr_1.1.4 

loaded via a namespace (and not attached):
  [1] pbdZMQ_0.3-11               BiocIO_1.12.0              
  [3] bitops_1.0-7                filelock_1.0.3             
  [5] tibble_3.2.1                R.oo_1.26.0                
  [7] basilisk.utils_1.14.1       bambu_3.4.0                
  [9] graph_1.80.0                XML_3.99-0.16.1            
 [11] rpart_4.1.23                lifecycle_1.0.4            
 [13] edgeR_4.0.16                doParallel_1.0.17          
 [15] OrganismDbi_1.44.0          ensembldb_2.26.0           
 [17] globals_0.16.3              lattice_0.22-6             
 [19] MultiAssayExperiment_1.28.0 backports_1.4.1            
 [21] magrittr_2.0.3              rmarkdown_2.26             
 [23] limma_3.58.1                Hmisc_5.1-2                
 [25] yaml_2.3.8                  metapod_1.10.0             
 [27] reticulate_1.36.1           ggbio_1.50.0               
 [29] cowplot_1.1.3               DBI_1.2.2                  
 [31] RColorBrewer_1.1-3          abind_1.4-5                
 [33] zlibbioc_1.48.0             GenomicRanges_1.54.1       
 [35] purrr_1.0.2                 R.utils_2.12.3             
 [37] AnnotationFilter_1.26.0     biovizBase_1.50.0          
 [39] BiocGenerics_0.48.1         RCurl_1.98-1.14            
 [41] nnet_7.3-19                 VariantAnnotation_1.48.1   
 [43] rappdirs_0.3.3              circlize_0.4.16            
 [45] GenomeInfoDbData_1.2.11     IRanges_2.36.0             
 [47] S4Vectors_0.40.2            ggrepel_0.9.5              
 [49] irlba_2.3.5.1               listenv_0.9.1              
 [51] dqrng_0.3.2                 parallelly_1.37.1          
 [53] DelayedMatrixStats_1.24.0   codetools_0.2-20           
 [55] DropletUtils_1.22.0         DelayedArray_0.28.0        
 [57] scuttle_1.12.0              xml2_1.3.6                 
 [59] tidyselect_1.2.1            shape_1.4.6.1              
 [61] viridis_0.6.5               ScaledMatrix_1.10.0        
 [63] matrixStats_1.3.0           stats4_4.3.3               
 [65] BiocFileCache_2.10.1        base64enc_0.1-3            
 [67] GenomicAlignments_1.38.0    jsonlite_1.8.8             
 [69] BiocNeighbors_1.20.0        GetoptLong_1.0.5           
 [71] Formula_1.2-5               scater_1.30.1              
 [73] iterators_1.0.14            foreach_1.5.2              
 [75] tools_4.3.3                 progress_1.2.3             
 [77] Rcpp_1.0.12                 glue_1.7.0                 
 [79] gridExtra_2.3               SparseArray_1.2.2          
 [81] xfun_0.43                   MatrixGenerics_1.14.0      
 [83] GenomeInfoDb_1.38.1         IRdisplay_1.1              
 [85] HDF5Array_1.30.0            withr_3.0.0                
 [87] BiocManager_1.30.23         fastmap_1.1.1              
 [89] GGally_2.2.1                basilisk_1.14.1            
 [91] bluster_1.12.0              rhdf5filters_1.14.1        
 [93] fansi_1.0.6                 rsvd_1.0.5                 
 [95] digest_0.6.35               R6_2.5.1                   
 [97] colorspace_2.1-0            dichromat_2.0-0.1          
 [99] biomaRt_2.58.0              RSQLite_2.3.4              
[101] R.methodsS3_1.8.2           utf8_1.2.4                 
[103] generics_0.1.3              data.table_1.15.2          
[105] rtracklayer_1.62.0          prettyunits_1.2.0          
[107] httr_1.4.7                  htmlwidgets_1.6.4          
[109] S4Arrays_1.2.0              ggstats_0.6.0              
[111] pkgconfig_2.0.3             gtable_0.3.5               
[113] blob_1.2.4                  ComplexHeatmap_2.18.0      
[115] SingleCellExperiment_1.24.0 XVector_0.42.0             
[117] htmltools_0.5.8.1           RBGL_1.78.0                
[119] ProtGenerics_1.34.0         clue_0.3-65                
[121] scales_1.3.0                Biobase_2.62.0             
[123] png_0.1-8                   scran_1.30.0               
[125] knitr_1.46                  rstudioapi_0.16.0          
[127] reshape2_1.4.4              rjson_0.2.21               
[129] uuid_1.2-0                  checkmate_2.3.0            
[131] curl_5.1.0                  repr_1.1.7                 
[133] cachem_1.0.8                rhdf5_2.46.1               
[135] GlobalOptions_0.1.2         stringr_1.5.1              
[137] vipor_0.4.7                 parallel_4.3.3             
[139] foreign_0.8-86              AnnotationDbi_1.64.1       
[141] restfulr_0.0.15             pillar_1.9.0               
[143] grid_4.3.3                  vctrs_0.6.5                
[145] BiocSingular_1.18.0         dbplyr_2.5.0               
[147] beachmat_2.18.0             cluster_2.1.6              
[149] beeswarm_0.4.0              htmlTable_2.4.2            
[151] evaluate_0.23               GenomicFeatures_1.54.1     
[153] cli_3.6.2                   locfit_1.5-9.9             
[155] compiler_4.3.3              Rsamtools_2.18.0           
[157] rlang_1.1.3                 crayon_1.5.2               
[159] ggbeeswarm_0.7.2            plyr_1.8.9                 
[161] stringi_1.8.4               viridisLite_0.4.2          
[163] BiocParallel_1.36.0         munsell_0.5.1              
[165] Biostrings_2.70.1           lazyeval_0.2.2             
[167] Matrix_1.6-5                dir.expiry_1.10.0          
[169] IRkernel_1.3.2              BSgenome_1.70.1            
[171] hms_1.1.3                   sparseMatrixStats_1.14.0   
[173] bit64_4.0.5                 future_1.33.2              
[175] ggplot2_3.5.1               Rhdf5lib_1.24.0            
[177] KEGGREST_1.42.0             statmod_1.5.0              
[179] SummarizedExperiment_1.32.0 igraph_2.0.3               
[181] memoise_2.0.1               bit_4.0.5                  
[183] xgboost_2.0.3.1            
nick-youngblut commented 1 month ago

If I use 3 samples (fastq files) of 500k reads each, BLAZE dies:

{
    "name": "ERROR",
    "message": "generator raised StopIteration",
    "stack": "generator raised StopIterationTraceback:

1. sc_long_pipeline(fastq = fastq_dir, annotation = ref_gtf_file, 
 .     genome_fa = ref_fasta_file, outdir = outdir, config_file = config_file, 
 .     expect_cell_number = 8000)
2. blaze(expect_cell_number, fastq, `output-prefix` = paste0(outdir, 
 .     \"/\"), `output-fastq` = \"matched_reads.fastq\", threads = config$pipeline_parameters$threads, 
 .     `max-edit-distance` = config$barcode_parameters$max_bc_editdistance, 
 .     overwrite = TRUE)
3. basiliskRun(env = flames_env, fun = function(blaze_argv) {
 .     cat(\"Running BLAZE...\
\")
 .     cat(\"Argument: \", blaze_argv, \"\
\")
 .     blaze <- reticulate::import(\"blaze\")
 .     ret <- blaze$blaze(blaze_argv)
 .     ret
 . }, blaze_argv = blaze_argv)
4. fun(...)
5. blaze$blaze(blaze_argv)
6. py_call_impl(callable, call_args$unnamed, call_args$named)"
}

I'm using a machine with 8 threads and 64 GB of memory, so I'm guessing that the issue is not due to a lack of memory.

The lack of a full stack trace for the BLAZE subprocess makes this issue hard to troubleshoot (a downside of calling python via reticulate::py_call_impl() versus keeping python and R code separate; e.g., different processes in a Nextflow pipeline).