mritchielab / FLAMES

A framework for performing single-cell and bulk read full-length analysis of mutations and splicing.
https://mritchielab.github.io/FLAMES/
GNU General Public License v3.0
19 stars 9 forks source link

FileNotFoundError: matched_reads.fastq #30

Open nick-youngblut opened 7 months ago

nick-youngblut commented 7 months ago

The traceback:

Error in py_call_impl(callable, call_args$unnamed, call_args$named) : 
  FileNotFoundError: [Errno 2] No such file or directory: '/home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//flames/matched_reads.fastq'
Run `reticulate::py_last_error()` for details.
8.
stop(<environment>)
7.
remove_reads_from_fastq at count_gene.py#440
6.
quantification at count_gene.py#487
5.
count$quantification(annotation, outdir, pipeline, n_process)
4.
fun(...)
3.
basiliskRun(env = flames_env, fun = function(annotation, outdir,
pipeline) {
python_path <- system.file("python", package = "FLAMES")
count <- reticulate::import_from_path("count_gene", python_path) ...
2.
quantify_gene(annotation, outdir, config$pipeline_parameters$threads,
pipeline = "sc_single_sample")
1.
sc_long_pipeline(fastq = fastq_input, genome_fa = ref_fna_input,
annotation = ref_annot_input, outdir = work_dir, config = config_file,
minimap2_dir = minimap2_path, expect_cell_number = 8000)

There is indeed no file named matched_reads.fastq in the output directory. The files in the output directory are:

5.7G align2genome.bam
3.1M align2genome.bam.bai
98M gene_count.csv
30K saturation_curve.png
20K tmp_align.sam

The entire run log:

04:47:14 PM Mon Apr 08 2024 Start running
Skipping Demultiplexing step...
Please make sure the ` /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//blaze_output/matched_reads.fastq.gz `` is the the demultiplexing output from previous FLAEMS call.
Running FLAMES pipeline...
#### Input parameters:
{
  "pipeline_parameters": {
    "seed": [2022],
    "threads": [12],
    "do_barcode_demultiplex": [false],
    "do_gene_quantification": [true],
    "do_genome_alignment": [true],
    "do_isoform_identification": [true],
    "bambu_isoform_identification": [false],
    "do_read_realignment": [true],
    "do_transcript_quantification": [true]
  },
  "barcode_parameters": {
    "max_bc_editdistance": [2],
    "max_flank_editdistance": [8],
    "pattern": {
      "primer": ["CTACACGACGCTCTTCCGATCT"],
      "BC": ["NNNNNNNNNNNNNNNN"],
      "UMI": ["NNNNNNNNNNNN"],
      "polyT": ["TTTTTTTTT"]
    },
    "TSO_seq": ["CCCATGTACTCTGCGTTGATACCACTGCTT"],
    "TSO_prime": [3],
    "full_length_only": [false]
  },
  "isoform_parameters": {
    "generate_raw_isoform": [false],
    "max_dist": [10],
    "max_ts_dist": [100],
    "max_splice_match_dist": [10],
    "min_fl_exon_len": [40],
    "max_site_per_splice": [3],
    "min_sup_cnt": [5],
    "min_cnt_pct": [0.001],
    "min_sup_pct": [0.2],
    "bambu_trust_reference": [true],
    "strand_specific": [0],
    "remove_incomp_reads": [4],
    "downsample_ratio": [1]
  },
  "alignment_parameters": {
    "use_junctions": [true],
    "no_flank": [false]
  },
  "realign_parameters": {
    "use_annotation": [true]
  },
  "transcript_counting": {
    "min_tr_coverage": [0.4],
    "min_read_coverage": [0.4]
  }
} 
gene annotation: /home/rstudio/workspace//data/references/human/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz 
genome fasta: /home/rstudio/workspace//data/references/human/GCF_000001405.40_GRCh38.p14_genomic.fna.gz 
input fastq: /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//blaze_output/matched_reads.fastq.gz 
output directory: /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//flames 
directory containing minimap2: /home/rstudio/miniconda3/bin 
#### Aligning reads to genome using minimap2
04:47:14 PM Mon Apr 08 2024 minimap2_align
/usr/bin/samtools
[M::mm_idx_gen::52.282*2.13] collected minimizers
[M::mm_idx_gen::58.856*3.07] sorted minimizers
[M::main::58.856*3.07] loaded/built the index for 705 target sequence(s)
[M::mm_mapopt_update::60.124*3.03] mid_occ = 2177
[M::mm_idx_stat] kmer size: 14; skip: 5; is_hpc: 0; #seq: 705
[M::mm_idx_stat::60.761*3.01] distinct minimizers: 65648619 (22.00% are singletons); average occurrences: 16.258; average spacing: 3.090; total length: 3298430636
[M::worker_pipeline::294.690*9.93] mapped 482699 sequences
[M::worker_pipeline::516.151*10.82] mapped 485246 sequences
[M::worker_pipeline::734.365*11.18] mapped 494113 sequences
[M::worker_pipeline::952.066*11.37] mapped 492944 sequences
[M::worker_pipeline::1180.514*11.49] mapped 487522 sequences
[M::worker_pipeline::1425.665*11.58] mapped 468694 sequences
[M::worker_pipeline::1670.820*11.65] mapped 468982 sequences
[M::worker_pipeline::1915.218*11.69] mapped 468921 sequences
[M::worker_pipeline::2161.791*11.73] mapped 468068 sequences
[M::worker_pipeline::2405.559*11.76] mapped 469711 sequences
[M::worker_pipeline::2636.732*11.78] mapped 476070 sequences
[M::worker_pipeline::2854.468*11.80] mapped 482292 sequences
[M::worker_pipeline::3072.530*11.82] mapped 482075 sequences
[M::worker_pipeline::3289.883*11.83] mapped 482633 sequences
[M::worker_pipeline::3507.992*11.84] mapped 482293 sequences
[M::worker_pipeline::3667.939*11.83] mapped 377902 sequences
[M::main] Version: 2.28-r1209
[M::main] CMD: /home/rstudio/miniconda3/bin/minimap2 -ax splice -t 12 -k14 --secondary=no --seed 2022 --junc-bed /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//flames/tmp_splice_anno.bed12 --junc-bonus 1 /home/rstudio/workspace//data/references/human/GCF_000001405.40_GRCh38.p14_genomic.fna.gz /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//blaze_output/matched_reads.fastq.gz
[M::main] Real time: 3668.571 sec; CPU: 43391.580 sec; Peak RSS: 21.167 GB
[bam_sort_core] merging from 17 files and 1 in-memory blocks...
05:56:31 PM Mon Apr 08 2024 Start gene quantification and UMI deduplication
05:56:31 PM Mon Apr 08 2024 quantify genes 
Found genome alignment file(s):     align2genome.bam
Assigning reads to genes...
Processed: 100%|██████████| 40375/40375 [02:41<00:00, 249.27gene_group/s]
Finalising the gene count matrix ...
Plotting the saturation curve ...
Generating deduplicated fastq file ...

I'm using FLAMES 1.8.0.

ChangqingW commented 7 months ago

I believe matched_reads.fastq is supposed to be created after the console output Generating deduplicated fastq file ..., maybe @youyupei can comment on this. In the mean time if you just want to get things done, it is perfectly fine to get around this by soft-linking ln -s input.fq matched_reads.fastq. Also, can you confirm the fastq file have the read IDs in the format of @[cell barcode]_[UMI]#[other stuff]?

youyupei commented 7 months ago

Hi @nick-youngblut , the matched_reads.fastq is expected to be in the output directory when the demultiplexing was done using the integrated demultiplexing step (either using BLAZE or flexiplex) in FLAMES . @ChangqingW is correct, if you have run BLAZE or flexible separately and have output into a different filename, you could simply make a symbolic link as @ChangqingW suggested.

As the genome alignment has been done, you don't have to rerun it. (you could set "do_genome_alignment": [false]).

nick-youngblut commented 7 months ago

Thanks @youyupei and @ChangqingW for the information.

BLAZE wrote a compressed fastq: matched_reads.fastq.gz. I used this file as input to the pipeline:

sc_long_pipeline(
    fastq = 'path/to/matches_reads.fastq.gz'
)

This is in the log that I provided:

input fastq: /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//blaze_output/matched_reads.fastq.gz 

The first line from the fastq: @GGAGCAACAAGTGGCA_GGGTGAACTCGA#c3b08a02-a1e2-4cf5-a4ea-8474f5dd9789_+. So, it appears that the format is @[cell barcode]_[UMI]#[other stuff.

More generally, sc_long_pipeline() took a long time to fail due to providing gzip'd fastq as input. It would be helpful if all input was checked at the start of the pipeline, so that the software can fail fast.

ChangqingW commented 7 months ago

I see. This is a bug. When skipping the pipeline's demultiplexing step, the gene quantification parts did not respect the fastq input name and was searching for matched_reads.fastq instead.

nick-youngblut commented 6 months ago

It's also coded to look for an uncompressed fastq, but BLAZE outputs a compressed fastq. I was hoping to just symlink the matched_reads.fastq.gz into the FLAMES working directory that I'm using, but I'm going to have gunzip first.

nick-youngblut commented 5 months ago

I see. This is a bug. When skipping the pipeline's demultiplexing step, the gene quantification parts did not respect the fastq input name and was searching for matched_reads.fastq instead.

Just wanted to check: any updates on this bug?

ChangqingW commented 4 months ago

Hi Nick, Sorry, not yet. I am a bit inundated with some other tasks at the moment

nick-youngblut commented 4 months ago

Thanks @ChangqingW for the update