google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
229 stars 36 forks source link

OSError: error -3 while reading file #71

Closed daaaaande closed 11 months ago

daaaaande commented 1 year ago

Dear Deepconsensus-Team, im using Deepconsensus 1.2 with chunking of the input through a snakemake pipeline: https://github.com/WestGermanGenomeCenter/deep_snake on a local HPC Cluster.

With chunking into 20 parts, i get the same error on one or more of the chunks while running deepconsensus: ... I1107 14:08:00.658040 48006032286144 quick_inference.py:931] Processed 114000 ZMWs in 95962.634 seconds Traceback (most recent call last): File "/software/DeepConsensus/1.2.0/skylake/bin/deepconsensus", line 8, in sys.exit(run()) File "/software/DeepConsensus/1.2.0/skylake/lib/python3.9/site-packages/deepconsensus/cli.py", line 118, in run app.run(main, flags_parser=parse_flags) File "/software/DeepConsensus/1.2.0/skylake/lib/python3.9/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/software/DeepConsensus/1.2.0/skylake/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "/software/DeepConsensus/1.2.0/skylake/lib/python3.9/site-packages/deepconsensus/cli.py", line 103, in main app.run(quick_inference.main, argv=passed) File "/software/DeepConsensus/1.2.0/skylake/lib/python3.9/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/software/DeepConsensus/1.2.0/skylake/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "/software/DeepConsensus/1.2.0/skylake/lib/python3.9/site-packages/deepconsensus/inference/quick_inference.py", line 977, in main outcome_counter = run() File "/software/DeepConsensus/1.2.0/skylake/lib/python3.9/site-packages/deepconsensus/inference/quick_inference.py", line 912, in run for zmw, subreads, dc_config, window_widths in input_file_generator: File "/software/DeepConsensus/1.2.0/skylake/lib/python3.9/site-packages/deepconsensus/inference/quick_inference.py", line 480, in stream_bam for input_data in proc_feeder(): File "/software/DeepConsensus/1.2.0/skylake/lib/python3.9/site-packages/deepconsensus/preprocess/pre_lib.py", line 1309, in proc_feeder for read_set in subread_grouper: File "/software/DeepConsensus/1.2.0/skylake/lib/python3.9/site-packages/deepconsensus/preprocess/pre_lib.py", line 73, in next read = next(self.bam_reader) File "pysam/libcalignmentfile.pyx", line 1876, in pysam.libcalignmentfile.AlignmentFile.next OSError: error -3 while reading file

The .bam alignment file seems to be ok: ~samtools quickcheck m54274Ue_230924_230713.subreads_15-of-20_ccs_lq_mapped_to_subreads.bam ~ls -lth m54274Ue_230924_230713.subreads_15-of-20_ccs_lq_mapped_to_subreads.bam -rwxrwx--- 1 user userss 40G Nov 2 19:44 m54274Ue_230924_230713.subreads_15-of-20_ccs_lq_mapped_to_subreads.bam

I did not find good information on what exactly error -3 is, but im sure the compute node is fine since this error comes with the same chunk on multiple nodes. Disk quota is also not the problem. The same error also occurs on a local non-hpc server with (uname:)6.2.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Oct 6 10:23:26 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Is this related to the pysam version? here its linux-64/pysam-0.22.0-py310h41dec4a_0.tar.bz2 on the local machine and 0.19 on the HPC, both throw the same error, even on the same part

danielecook commented 1 year ago

@daaaaande I don't think the pysam version is an issue.

The error is happening here:

https://github.com/google/deepconsensus/blob/979108ddbc0a950ea493071690507ac32c8ee9cb/deepconsensus/preprocess/pre_lib.py#L68-L78

The error suggests an issue with the bam file - but I wonder if it's a memory issue?

  1. Try to reduce --batch_zmws from 1000 to a smaller number in this line. We typically use 100.
  2. Do any of the chunks succeed? Can you examine the file sizes of the chunks that succeed vs. the ones that fail?
  3. Try to run with a larger number of chunks (e.g. 40) and see if the problem persists.
daaaaande commented 1 year ago

thanks for the swift reply! :) to address your suggestions:

  1. made the parameter changeable and default 100. currently testing this - will report back if it changed anything!
  2. most chunks do succeed, no difference in file sizes between the chunks so far
  3. did already try to chunk into 100 parts, sadly resulted in the same error but on more chunks The resources given to the job are not the problem as far as i can tell, but just to make sure i increased the time and memory again, thanks for the feedback and i will report back!
daaaaande commented 12 months ago

Update:

I'm now trying even smaller chunks (100) and 50 zmw batch size.

daaaaande commented 11 months ago

Update II:

danielecook commented 11 months ago

Thanks for the update @daaaaande - it might be helpful if you provide the command you are running.

daaaaande commented 11 months ago

@danielecook thanks for the hint, will give you more information:

the logfile of deepconsensus:

I1120 19:59:39.677627 47130953445824 quick_inference.py:693] Example summary: ran model=3797 (75.83%; 7.357s) skip=1210 (24.17%; 0.148s) total=5007.
I1120 19:59:39.783081 47130953445824 quick_inference.py:770] Processed a batch of 20 ZMWs in 8.323 seconds
I1120 19:59:39.787414 47130953445824 quick_inference.py:931] Processed 15700 ZMWs in 8150.246 seconds
Traceback (most recent call last):
  File "/software/DeepConsensus/1.2.0/skylake/bin/deepconsensus", line 8, in <module>
    sys.exit(run())
  File "/software/DeepConsensus/1.2.0/icelake/lib/python3.9/site-packages/deepconsensus/cli.py", line 118, in run
    app.run(main, flags_parser=parse_flags)
  File "/software/DeepConsensus/1.2.0/icelake/lib/python3.9/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/software/DeepConsensus/1.2.0/icelake/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/software/DeepConsensus/1.2.0/icelake/lib/python3.9/site-packages/deepconsensus/cli.py", line 103, in main
    app.run(quick_inference.main, argv=passed)
  File "/software/DeepConsensus/1.2.0/icelake/lib/python3.9/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/software/DeepConsensus/1.2.0/icelake/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/software/DeepConsensus/1.2.0/icelake/lib/python3.9/site-packages/deepconsensus/inference/quick_inference.py", line 977, in main
    outcome_counter = run()
  File "/software/DeepConsensus/1.2.0/icelake/lib/python3.9/site-packages/deepconsensus/inference/quick_inference.py", line 912, in run
    for zmw, subreads, dc_config, window_widths in input_file_generator:
  File "/software/DeepConsensus/1.2.0/icelake/lib/python3.9/site-packages/deepconsensus/inference/quick_inference.py", line 480, in stream_bam
    for input_data in proc_feeder():
  File "/software/DeepConsensus/1.2.0/icelake/lib/python3.9/site-packages/deepconsensus/preprocess/pre_lib.py", line 1309, in proc_feeder
    for read_set in subread_grouper:
  File "/software/DeepConsensus/1.2.0/icelake/lib/python3.9/site-packages/deepconsensus/preprocess/pre_lib.py", line 73, in __next__
    read = next(self.bam_reader)
  File "pysam/libcalignmentfile.pyx", line 1876, in pysam.libcalignmentfile.AlignmentFile.__next__
OSError: error -3 while reading file

Snakemake error message:

Error in rule run_deep:
    jobid: 12
    input: m54274Ue_230926_083040.subreads_4-of-120_ccs_lq_mapped_to_subreads.bam, m54274Ue_230926_083040.subreads_4-of-120_ccs_lq.bam
    output: m54274Ue_230926_083040.subreads_4-of-120_deepconsensus.fastq
    conda-env: .snakemake/conda/94e59aa9aceab381a4abc6c7f412b7a1_
    shell:

       module load DeepConsensus &>m54274Ue_230926_083040.subreads_4-of-120_deepconsensus_logfile.log
       /software/DeepConsensus/1.2.0/skylake/bin/deepconsensus run --subreads_to_ccs=m54274Ue_230926_083040.subreads_4-of-120_ccs_lq_mapped_to_subreads.bam --ccs_bam=m54274Ue_230926_083040.s
ubreads_4-of-120_ccs_lq.bam --checkpoint=checkpoint --output=m54274Ue_230926_083040.subreads_4-of-120_deepconsensus.fastq --batch_zmws 20 --batch_size=512 --cpus=16 &>m54274Ue_230926_083040.subreads_4-of-120_deepconsensus_logfile.log

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    cluster_jobid: 8691257.hpc-batch

Error executing rule run_deep on cluster (jobid: 12, external: 8691257.hpc-batch, jobscript: .snakemake/tmp.8cb82736/snakejob.run
_deep.12.sh). For error details see the cluster log and the log files of the involved rule(s).

ls of the affected chunk:

ls -lth 4-of-120

-rwxrwx--- 1 daric102 users 282K Nov 20 19:59 m54274Ue_230926_083040.subreads_4-of-120_deepconsensus_logfile.log
-rwxrwx--- 1 daric102 users 5.8G Nov 20 16:23 m54274Ue_230926_083040.subreads_4-of-120_ccs_lq_mapped_to_subreads.bam
-rwxrwx--- 1 daric102 users  19M Nov 20 16:23 m54274Ue_230926_083040.subreads_4-of-120_actc_.log
-rwxrwx--- 1 daric102 users 284M Nov 20 16:19 m54274Ue_230926_083040.subreads_4-of-120_ccs_lq_mapped_to_subreads.fasta
-rwxrwx--- 1 daric102 users 344M Nov 19 13:17 m54274Ue_230926_083040.subreads_4-of-120_ccs_lq.bam
-rwxrwx--- 1 daric102 users 1.5K Nov 19 13:17 m54274Ue_230926_083040.subreads_4-of-120_ccs_lq.ccs_report.txt
-rwxrwx--- 1 daric102 users 2.5M Nov 19 13:17 m54274Ue_230926_083040.subreads_4-of-120_ccs_lq.zmw_metrics.json.gz
-rwxrwx--- 1 daric102 users 219K Nov 19 13:17 m54274Ue_230926_083040.subreads_4-of-120_ccs_lq.bam.pbi

ls of a typical, non-affected chunk:

ls -lth 14-of-120

-rwxrwx--- 1 daric102 users 540M Nov 20 19:55 m54274Ue_230926_083040.subreads_14-of-120_deepconsensus.fastq
-rwxrwx--- 1 daric102 users   88 Nov 20 19:54 m54274Ue_230926_083040.subreads_14-of-120_deepconsensus.fastq.inference.json
-rwxrwx--- 1 daric102 users 111K Nov 20 19:54 m54274Ue_230926_083040.subreads_14-of-120_deepconsensus.fastq.runtime.csv
-rwxrwx--- 1 daric102 users 272K Nov 20 19:54 m54274Ue_230926_083040.subreads_14-of-120_deepconsensus_logfile.log
-rwxrwx--- 1 daric102 users 5.5G Nov 20 16:23 m54274Ue_230926_083040.subreads_14-of-120_ccs_lq_mapped_to_subreads.bam
-rwxrwx--- 1 daric102 users  19M Nov 20 16:22 m54274Ue_230926_083040.subreads_14-of-120_actc_.log
-rwxrwx--- 1 daric102 users 276M Nov 20 16:20 m54274Ue_230926_083040.subreads_14-of-120_ccs_lq_mapped_to_subreads.fasta
-rwxrwx--- 1 daric102 users 332M Nov 19 12:30 m54274Ue_230926_083040.subreads_14-of-120_ccs_lq.bam
-rwxrwx--- 1 daric102 users 1.5K Nov 19 12:30 m54274Ue_230926_083040.subreads_14-of-120_ccs_lq.ccs_report.txt
-rwxrwx--- 1 daric102 users 2.4M Nov 19 12:30 m54274Ue_230926_083040.subreads_14-of-120_ccs_lq.zmw_metrics.json.gz
-rwxrwx--- 1 daric102 users 210K Nov 19 12:30 m54274Ue_230926_083040.subreads_14-of-120_ccs_lq.bam.pbi

actc used:

(test_2) [daric102@hilbert213 deepconsensussnake]$ conda activate .snakemake/conda/94e59aa9aceab381a4abc6c7f412b7a1 (.snakemake/conda/94e59aa9aceab381a4abc6c7f412b7a1_) [daric102@hilbert213 deepconsensus_snake]$ actc --version actc 0.6.0

Using: actc : 0.6.0 (commit v0.6.0) pbbam : 2.4.99 (commit v2.4.0-23-g59248fe) pbcopper : 2.3.99 (commit v2.3.0-28-ga9b1ffa) boost : 1.81 htslib : 1.17 zlib : 1.2.13

checking the stats of the failed chunk:

samtools stats m54274Ue_230926_083040.subreads_4-of-120_ccs_lq_mapped_to_subreads.bam >m54274Ue_230926_083040.subreads_4-of-120_mapped_stats_.out
Failure while decoding file

loading the deepconsensus module:

module load DeepConsensus

  CUDA Toolkit 11.4.3
  |--------------------------------------------------------------~o
    binaries      : /software/CUDA/11.4.3/bin
    libraries     : /software/CUDA/11.4.3/lib64
    includes      : /software/CUDA/11.4.3/include
    documentation : /software/CUDA/11.4.3/doc

  Intel oneAPI-2022.2
  |--------------------------------------------------------------~o
    C compiler         : icc
    C++ compiler       : icpc
    FORTRAN77 compiler : ifort
    FORTRAN90 compiler : ifort
    FORTRAN95 compiler : ifort

    runtime libraries  : /software/intel/oneAPI-2022.2/compiler/lib/intel64

  Intel Math Kernel Library (2022.2.1.0)
  |--------------------------------------------------------------~o
    libraries          : /software/intel/oneAPI-2022.2/mkl/2022.1.0/lib/intel64
    interfaces         : /software/intel/oneAPI-2022.2/mkl/2022.1.0/interfaces

  IntelMPI (2021.6.0)
  |--------------------------------------------------------------~o
    C compiler         : mpiicc
    C++ compiler       : mpiicpc
    FORTRAN77 compiler : mpiifort
    FORTRAN90 compiler : mpiifort
    FORTRAN95 compiler : mpiifort

    runtime libraries  : /software/intel/oneAPI-2022.2/mpi/2021.6.0/intel64/lib

  Intel Python Version 3.9.10
  |--------------------------------------------------------------~o
    binaries  : /software/intel/oneAPI-2022.2/intelpython/latest/bin
    libs       : /software/intel/oneAPI-2022.2/intelpython/latest/lib
    includes  : /software/intel/oneAPI-2022.2/intelpython/latest/include

  DeepConsensus 1.2.0

|--------------------------------------------------------------~o
    binaries : /software/DeepConsensus/1.2.0/skylake/bin

ressources requested per deepconsensus - chunk:

 "nodes" : 1,
      "time" : "8:59:59",
      "mem" : "72GB",
      "ncpus" : "16",
      "ngpus" : "0",
      "arch" : "skylake",

actc log:

...
 | 20231120 15:23:01.419 | DEBUG    | CLR parser     : m54274Ue_230926_083040/6751778/241410_263248
| 20231120 15:23:01.419 | DEBUG    | CLR parser     : m54274Ue_230926_083040/6751778/263277_269659
| 20231120 15:23:01.419 | FATAL    | CCS reader     : CCS ZMW 6751779 has multiple records. Ignoring ZMW!
...

ccs report:

                         Double-Strand Reads  Single-Strand Reads
Inputs                   :      41714 (94.03%)        5292 (5.965%)

Passed                   :      15604 (37.41%)        4553 (86.04%)
Failed                   :      26110 (62.59%)         739 (13.96%)

Tandem repeats           :        335 (1.283%)         128 (17.32%)

Exclusive failed counts
Below SNR threshold      :        715 (2.738%)           0 (0.000%)
Median length filter     :          0 (0.000%)           0 (0.000%)
Shortcut filters         :          0 (0.000%)           0 (0.000%)
Lacking full passes      :      24841 (95.14%)         638 (86.33%)
Coverage drops           :         69 (0.264%)          14 (1.894%)
Insufficient draft cov   :        101 (0.387%)          16 (2.165%)
Draft too different      :          0 (0.000%)           0 (0.000%)
Draft generation error   :        374 (1.432%)          68 (9.202%)
Draft above --max-length :          1 (0.004%)           0 (0.000%)
Draft below --min-length :          0 (0.000%)           0 (0.000%)
Reads failed polishing   :          0 (0.000%)           0 (0.000%)
Empty coverage windows   :          7 (0.027%)           1 (0.135%)
CCS did not converge     :          1 (0.004%)           2 (0.271%)
CCS below minimum RQ     :          1 (0.004%)           0 (0.000%)
Unknown error            :          0 (0.000%)           0 (0.000%)

Additional passing metrics
Missing adapters         :        408 (2.615%)         169 (3.712%)

did samtools stats of the ccs_lq.bam:

This file was produced by samtools stats (1.18+htslib-1.18) and can be plotted using plot-bamstats
 This file contains statistics for all reads.
 The command line was:  stats m54274Ue_230926_083040.subreads_4-of-120_ccs_lq.bam
 CHK, Checksum [2]Read Names   [3]Sequences    [4]Qualities
 CHK, CRC32 of reads which passed filtering followed by addition (32bit overflow)
CHK     31679cf3        96079b4e        df23c499
 Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN      raw total sequences:    20157   # excluding supplementary and secondary reads
SN      filtered sequences:     0
SN      sequences:      20157
SN      is sorted:      1
SN      1st fragments:  20157
SN      last fragments: 0
SN      reads mapped:   0
SN      reads mapped and paired:        0       # paired-end technology bit set + both mates mapped
SN      reads unmapped: 20157
SN      reads properly paired:  0       # proper-pair bit set
SN      reads paired:   0       # paired-end technology bit set
SN      reads duplicated:       0       # PCR or optical duplicate bit set
SN      reads MQ0:      0       # mapped and MQ=0
SN      reads QC failed:        0
SN      non-primary alignments: 0
SN      supplementary alignments:       0
SN      total length:   385773399       # ignores clipping
SN      total first fragment length:    385773399       # ignores clipping
SN      total last fragment length:     0       # ignores clipping
SN      bases mapped:   0       # ignores clipping
SN      bases mapped (cigar):   0       # more accurate
SN      bases trimmed:  0
SN      bases duplicated:       0
SN      mismatches:     0       # from NM fields
SN      error rate:     0.000000e+00    # mismatches / bases mapped (cigar)
SN      average length: 19138
SN      average first fragment length:  19138
SN      average last fragment length:   0
SN      maximum length: 46259
SN      maximum first fragment length:  46259
SN      maximum last fragment length:   0

samtools checking again the ccs_mapped_to_subreads.bam

samtools quickcheck m54274Ue_230926_083040.subreads_4-of-120_ccs_lq_mapped_to_subreads.bam -vvvvv
verbosity set to 5
checking m54274Ue_230926_083040.subreads_4-of-120_ccs_lq_mapped_to_subreads.bam
opened m54274Ue_230926_083040.subreads_4-of-120_ccs_lq_mapped_to_subreads.bam
m54274Ue_230926_083040.subreads_4-of-120_ccs_lq_mapped_to_subreads.bam is sequence data
m54274Ue_230926_083040.subreads_4-of-120_ccs_lq_mapped_to_subreads.bam has 15706 targets in header.
m54274Ue_230926_083040.subreads_4-of-120_ccs_lq_mapped_to_subreads.bam has good EOF block.

if you want i can share loads of logs, just let me know what you think would be the most useful! Also, the ccs_lq_mapped_to_subreads not being parsed by samtools stats gives me the idea that this .bam file ist corrupt/ not according to specs/ confusing deepconsensus - but the quickcheck does not complain. Any idea there?

daaaaande commented 11 months ago

mini-update: so it seems like i fixed the issue by downgrading actc to 0.2.0. Still need to test more, but as soon as another smrtcell also works with this i will close this issue.

daaaaande commented 11 months ago

The actc downgrade seems to fix the issue. A second smrtcell also ran without hiccups.