pipeline crashed : scaffold #1

Closed lucabianco78 closed 2 years ago

lucabianco78 commented 2 years ago


I am trying to use gapless on a genome with ONT data. Unfortunately, I get the error below. When it crashes I get this message to std out: "pipeline crashed : scaffold"

Can you please give any advice? Thanks

Traceback (most recent call last): File "/usr/local/bin/gapless//", line 13263, in main(sys.argv[1:]) File "/usr/local/bin/gapless//", line 13092, in main GaplessScaffold(args[0], args[1], args[2], min_mapq, min_mapping_length, min_length_contig_break, prefix, stats) File "/usr/local/bin/gapless//", line 9039, in GaplessScaffold scaffold_paths, trim_repeats = ScaffoldContigs(contig_parts, bridges, mappings, cov_probs, repeats, prob_factor, min_mapping_length, max_dist_contig_end, prematurity_threshold, ploidy, max_loop_units) File "/usr/local/bin/gapless//", line 7783, in ScaffoldContigs scaffold_paths = TraverseScaffoldGraph(scaffolds, scaffold_graph, graph_ext, scaf_bridges, org_scaf_conns, ploidy, max_loop_units) File "/usr/local/bin/gapless//", line 7314, in TraverseScaffoldGraph CheckIfScaffoldPathsFollowsValidBridges(scaffold_paths, scaf_bridges, ploidy) File "/usr/local/bin/gapless//", line 4746, in CheckIfScaffoldPathsFollowsValidBridges raise RuntimeError("Scaffold path contains invalid bridges.") RuntimeError: Scaffold path contains invalid bridges.

schmeing commented 2 years ago

Hi, Thanks for letting me know. More information on what happened will be in the file passX/logs/gapless_scaffold.log within your selected output folder. The X is the current pass number, where it crashed. Thus, the highest number is the interesting one there. I will need to debug this and fix the code. The simplest way would be if you can provide me the exact command and all the input files at . Otherwise, we likely need several iterations of me telling you where to change the code and you telling me the results. Best, Stephan

aabaricalla commented 2 years ago

Hi there! I was testing Gapless with my data and I have the same problem with PB CLR data. Any suggestion would be great. I'll be waiting for any solution or update.


schmeing commented 2 years ago

The issue was based on a change of behaviour between pandas version 1.3.1 and 1.4.2. I assume it was unintentionally, but I still have to create a minimal working example and check with the pandas team. Independent of that I added a workaround and pushed it to github. Please verify if your data works after the new gapless commit.

Thanks, for bringing this to my attention.

Mjaraespejo commented 2 years ago


I am also having issues during scaffolding. I am using PacBio HiFi data. I am attaching the gapless_scaffold.log file. Hope you can help me.


Thanks, Manuel

schmeing commented 2 years ago

Hello Manuel,

thank you for reporting this bug. In all my own tests I never managed to create this inconsistent state. It could be caused nearly anywhere in the pipeline. Thus, unfortunately without your data there is little I can do. I assume you ran the In that case, if you can share the gapless_split.fa, gapless_reads.paf and gapless_split_repeats.paf with me at I will go through the scaffolding and see what is causing it.

Thanks, Stephan

splaisan commented 2 years ago


I have ran gapless few days ago and today I run into this same error (I ran git pool without improvement so I am uptodate). A bioconda dependency list would be a nice thing to be able to reinstall the right tools and get a functional env. thanks in advance for your help

my gapless env panda is v1.3.1

cat gapless_run/pass1/logs/gapless_scaffold.log 
0:00:07.512620 Reading in original assembly
0:00:07.540919 Loading repeats
0:00:07.545660 Filtering mappings
Traceback (most recent call last):
  File "/opt/biotools/bin/", line 13325, in <module>
  File "/opt/biotools/bin/", line 13154, in main
    GaplessScaffold(args[0], args[1], args[2], min_mapq, min_mapping_length, min_length_contig_break, prefix, stats)
  File "/opt/biotools/bin/", line 9068, in GaplessScaffold
    mappings, cov_counts, cov_probs, read_names, read_len = ReadMappings(mapping_file, contig_ids, min_mapq, min_mapping_length, keep_all_subreads, alignment_precision, num_read_len_groups, pdf)
  File "/opt/biotools/bin/", line 415, in ReadMappings
    mappings = ReadPaf(mapping_file)
  File "/opt/biotools/bin/", line 217, in ReadPaf
    return pd.read_csv(file_name, sep='\t', header=None, usecols=range(12), names=['q_name','q_len','q_start','q_end','strand','t_name','t_len','t_start','t_end','matches','alignment_length','mapq'], dtype={'q_name':object, 'q_len':np.int32, 'q_start':np.int32, 'q_end':np.int32, 'strand':str, 't_name':object, 't_len':np.int32, 't_start':np.int32, 't_end':np.int32, 'matches':np.int32, 'alignment_length':np.int32, 'mapq':np.int16})
  File "/opt/miniconda3/envs/gapless/lib/python3.9/site-packages/pandas/util/", line 311, in wrapper
    return func(*args, **kwargs)
  File "/opt/miniconda3/envs/gapless/lib/python3.9/site-packages/pandas/io/parsers/", line 586, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/opt/miniconda3/envs/gapless/lib/python3.9/site-packages/pandas/io/parsers/", line 482, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/opt/miniconda3/envs/gapless/lib/python3.9/site-packages/pandas/io/parsers/", line 811, in __init__
    self._engine = self._make_engine(self.engine)
  File "/opt/miniconda3/envs/gapless/lib/python3.9/site-packages/pandas/io/parsers/", line 1040, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/opt/miniconda3/envs/gapless/lib/python3.9/site-packages/pandas/io/parsers/", line 51, in __init__
    self._open_handles(src, kwds)
  File "/opt/miniconda3/envs/gapless/lib/python3.9/site-packages/pandas/io/parsers/", line 222, in _open_handles
    self.handles = get_handle(
  File "/opt/miniconda3/envs/gapless/lib/python3.9/site-packages/pandas/io/", line 701, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: 'pass1/gapless_reads.paf'
splaisan commented 2 years ago

Hi again Stephan,

I re-discovered the initial cause of my crash, my fastq reads were compressed!

I fixed that before from some ticket but had forgotten since, if confirmed, could you please write this clearly in the and maybe include the conda env I attached below (and maybe the link to your bioXiv manuscript).

Now, with my plain fastq reads the command crashed later in the process: pipeline crashed: finish


Note: I use conda 4.12.0 as the next version breaks some of my existing envs

RUN and die \
  -i flye_pilon.fa \
  -j 24 \
  -n 3 \
  -o gl_out \
  -t pb_hifi \

I attach the environment file for testing as well as the archive of my failed run

Thanks a lot for your tool!



Note: I just sent you a Filesender link to the reads (8.6 GB) to your uzh email .

schmeing commented 2 years ago

Hi, regarding your first issue: Gzipped reads should work. However, the crash does not seem to come from gapless, but from minimap2. To see what the problems with the compressed reads are you need to look at: logs/minimap2_reads.log

Regarding the conda environment. I will create a bioconda package once the errors dripping in are fixed. Lately, I created a new conda environment for gapless with the following command: conda create -c conda-forge --name gapless python=3.10.2 pandas=1.4.2 numpy=1.22.3 scipy=1.8.0 seaborn matplotlib pillow biopython However, this does not include the external requirements of minimap2, racon and seqtk used in the bash script. The conda packages for those exist and can be added if people like: conda install -c bioconda minimap2 seqtk racon

The versions should also not be of great importance, so if you get errors with other (recent) versions (lowest I tried was python 3.6 and pandas 1.1.0) please let me know. Regarding the conda environment. I will create a bioconda package once the errors dripping in are fixed. Lately, I created a new conda environment for gapless with the following command: conda create -c conda-forge --name gapless python=3.10.2 pandas=1.4.2 numpy=1.22.3 scipy=1.8.0 seaborn matplotlib pillow biopython However, this does not include the external requirements of minimap2, racon and seqtk used in the bash script. The conda packages for those exist and can be added if people like: conda install -c bioconda minimap2 seqtk racon

The versions should also not be of great importance, so if you get errors with other (recent) versions (lowest I tried was python 3.6 and pandas 1.1.0) please let me know.

schmeing commented 2 years ago

An important comment: If you send something to my uzh email please do so as Stephane did and state the problem here and announce that you sent something. This email account has much more spam than relevant emails these days. Without his announcement here I would have missed the email! So if somebody has not received an answer by now, I simply have missed your email.

schmeing commented 2 years ago

Hi Stephane,

Your second issue is now resolved. I pushed the fixed to the github a minute ago. Thank you for providing all the data for a quick reproduction and fix of the bug.

splaisan commented 2 years ago

Hi Stefan,

Thanks for your feedback and new version. I pulled the current git and created a whole new env as detailed above. The env created without issues, in only noticed that when adding minimap2, three initial packages were updated because of bioconda. No idea if this will be relevant at a later stage, just to mention it.

The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates    conda-forge::ca-certificates-2022.6.1~ --> pkgs/main::ca-certificates-2022.4.26-h06a4308_0
  certifi            conda-forge::certifi-2022.6.15-py310h~ --> pkgs/main::certifi-2022.5.18.1-py310h06a4308_0
  openssl            conda-forge::openssl-1.1.1o-h166bdaf_0 --> pkgs/main::openssl-1.1.1o-h7f8727e_0

When running again my code, I immediately got the scaffolding error. I looked into the logs and it seems that "${org_path}/${reads}" at line 138 of your bash wrapper duplicated the read path (which are in a separate folder and given as full path to, leading to an early crash.

I copied the reads.gz locally and it now runs and finishes. (btw, this is why I thought gzip was a problem when I copied the decompressed reads locally before and it worked better)

Thanks for your help! Stephane :switzerland:


[M::mm_idx_gen::0.359*1.01] collected minimizers
[M::mm_idx_gen::0.411*4.79] sorted minimizers
[M::main::0.411*4.79] loaded/built the index for 32 target sequence(s)
[M::mm_mapopt_update::0.447*4.49] mid_occ = 50
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 0; #seq: 32
[M::mm_idx_stat::0.470*4.32] distinct minimizers: 2084403 (95.23% are singletons); average occurrences: 1.075; average spacing: 5.495; total length: 12317436
ERROR: failed to open file '/data/analyses/SRR18210286_Scerevisiae_HiFi/gapless_assemblies//data/analyses/SRR18210286_Scerevisiae_HiFi/reads/SRR18210286.fq.gz': No such file or directory
ERROR: failed to map the query file

my command was:


source /etc/profile.d/
# reads=SRR18210286.fq.gz
cp ../pilon_assemblies/*_pilon.fa .

# conda create -c conda-forge --name gapless python=3.10.2 pandas=1.4.2 \
#  numpy=1.22.3 scipy=1.8.0 seaborn matplotlib pillow biopython
# then within the new env:
# conda install -c bioconda minimap2 seqtk racon

conda activate ${myenv} || \
  ( echo "# the conda environment ${myenv} was not found on this machine" ;
    echo "# please read the top part of the script!" \
    && exit 1 )

for asm in flye_pilon.fa; do
#  put aside for debugging: hicanu_pilon.fa hifiasm_pilon.fa ipa_pilon.fa nd_pilon.fa; do


echo "# gapless scaffolding for ${pfx}" \
  -i ${asm} \
  -j ${thr} \
  -n 3 \
  -o gapless_${pfx} \
  -t pb_hifi \

# copy final asm to local folder
cp gapless_${pfx}/gapless.fa ${pfx%_pilon}_gapless.fa


conda deactivate

exit 0
schmeing commented 2 years ago

Sorry for that. I did not support absolute paths. That is fixed now.

splaisan commented 2 years ago

Hi Stefan,

Today I have a weird issue, the pipeline which worked for two assemblies (made with IPA and flye on the same reads) failed now twice with the assembly from hicanu.

When I redo ipa after the failed hicanu, it works

Another assembly fails (hifiasm) with the same error.

Could it be that hicanu and hifiasm issue haplotigs (pairs of contigs) rather than consensus contigs and somehow this is bothering your pipeline?

I echo the scaffold log and attach the assembly input file, the reads are the same as before.

# hicanu log
$ cat gapless_scaffold.log
/opt/miniconda3/envs/gapless/lib/python3.10/site-packages/seaborn/ FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/miniconda3/envs/gapless/lib/python3.10/site-packages/seaborn/ FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/biotools/bin/ UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set(xticklabels=np.where(locs.astype(int) == locs, (10 ** locs).astype(str), ""))
/opt/miniconda3/envs/gapless/lib/python3.10/site-packages/seaborn/ FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/biotools/bin/ UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set(xticklabels=np.where(locs.astype(int) == locs, (10 ** locs).astype(str), ""))
/opt/miniconda3/envs/gapless/lib/python3.10/site-packages/seaborn/ FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/biotools/bin/ UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set(xticklabels=np.where(locs.astype(int) == locs, (10 ** locs).astype(str), ""))
0:00:07.969616 Reading in original assembly
0:00:08.011044 Loading repeats
0:00:08.043497 Filtering mappings
0:00:18.317045 Search for possible break points
0:00:38.241233 Search for possible bridges
0:00:38.436451 Scaffold the contigs
Traceback (most recent call last):
  File "/opt/miniconda3/envs/gapless/lib/python3.10/site-packages/numpy/core/", line 57, in _wrapfunc
    return bound(*args, **kwds)
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/biotools/bin/", line 13326, in <module>
  File "/opt/biotools/bin/", line 13155, in main
    GaplessScaffold(args[0], args[1], args[2], min_mapq, min_mapping_length, min_length_contig_break, prefix, stats)
  File "/opt/biotools/bin/", line 9100, in GaplessScaffold
    scaffold_paths, trim_repeats = ScaffoldContigs(contig_parts, bridges, mappings, cov_probs, repeats, prob_factor, min_mapping_length, max_dist_contig_end, prematurity_threshold, ploidy, max_loop_units)
  File "/opt/biotools/bin/", line 7847, in ScaffoldContigs
    scaffold_paths = ExpandScaffoldsWithContigs(scaffold_paths, scaffolds, scaffold_parts, ploidy)
  File "/opt/biotools/bin/", line 7697, in ExpandScaffoldsWithContigs
    scaffold_paths = scaffold_paths.loc[np.repeat(scaffold_paths.index.values, scaffold_paths[[f'size{h}' for h in range(ploidy)]].max(axis=1).values)]
  File "<__array_function__ internals>", line 180, in repeat
  File "/opt/miniconda3/envs/gapless/lib/python3.10/site-packages/numpy/core/", line 479, in repeat
    return _wrapfunc(a, 'repeat', repeats, axis=axis)
  File "/opt/miniconda3/envs/gapless/lib/python3.10/site-packages/numpy/core/", line 66, in _wrapfunc
    return _wrapit(obj, method, *args, **kwds)
  File "/opt/miniconda3/envs/gapless/lib/python3.10/site-packages/numpy/core/", line 43, in _wrapit
    result = getattr(asarray(obj), method)(*args, **kwds)
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

# hifiasm log
$ cat gapless_scaffold.log
/opt/miniconda3/envs/gapless/lib/python3.10/site-packages/seaborn/ FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/miniconda3/envs/gapless/lib/python3.10/site-packages/seaborn/ FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/biotools/bin/ UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set(xticklabels=np.where(locs.astype(int) == locs, (10 ** locs).astype(str), ""))
/opt/miniconda3/envs/gapless/lib/python3.10/site-packages/seaborn/ FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/biotools/bin/ UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set(xticklabels=np.where(locs.astype(int) == locs, (10 ** locs).astype(str), ""))
/opt/miniconda3/envs/gapless/lib/python3.10/site-packages/seaborn/ FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/opt/biotools/bin/ UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set(xticklabels=np.where(locs.astype(int) == locs, (10 ** locs).astype(str), ""))
0:00:07.740485 Reading in original assembly
0:00:07.776979 Loading repeats
0:00:07.814693 Filtering mappings
0:00:18.338776 Search for possible break points
0:00:38.278662 Search for possible bridges
0:00:40.703660 Scaffold the contigs
Traceback (most recent call last):
  File "/opt/miniconda3/envs/gapless/lib/python3.10/site-packages/numpy/core/", line 57, in _wrapfunc
    return bound(*args, **kwds)
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/biotools/bin/", line 13326, in <module>
  File "/opt/biotools/bin/", line 13155, in main
    GaplessScaffold(args[0], args[1], args[2], min_mapq, min_mapping_length, min_length_contig_break, prefix, stats)
  File "/opt/biotools/bin/", line 9100, in GaplessScaffold
    scaffold_paths, trim_repeats = ScaffoldContigs(contig_parts, bridges, mappings, cov_probs, repeats, prob_factor, min_mapping_length, max_dist_contig_end, prematurity_threshold, ploidy, max_loop_units)
  File "/opt/biotools/bin/", line 7847, in ScaffoldContigs
    scaffold_paths = ExpandScaffoldsWithContigs(scaffold_paths, scaffolds, scaffold_parts, ploidy)
  File "/opt/biotools/bin/", line 7697, in ExpandScaffoldsWithContigs
    scaffold_paths = scaffold_paths.loc[np.repeat(scaffold_paths.index.values, scaffold_paths[[f'size{h}' for h in range(ploidy)]].max(axis=1).values)]
  File "<__array_function__ internals>", line 180, in repeat
  File "/opt/miniconda3/envs/gapless/lib/python3.10/site-packages/numpy/core/", line 479, in repeat
    return _wrapfunc(a, 'repeat', repeats, axis=axis)
  File "/opt/miniconda3/envs/gapless/lib/python3.10/site-packages/numpy/core/", line 66, in _wrapfunc
    return _wrapit(obj, method, *args, **kwds)
  File "/opt/miniconda3/envs/gapless/lib/python3.10/site-packages/numpy/core/", line 43, in _wrapit
    result = getattr(asarray(obj), method)(*args, **kwds)
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'
schmeing commented 2 years ago

I pushed the fix. Thanks for finding the bug.

splaisan commented 2 years ago

Stefan, Your last edits did the magic, I could now run the missing two gapless processes Thank you very much for your nice support and great tool

xhu556 commented 2 years ago


I am testing gapless. When it crashes I get this message to std out: "pipeline crashed : scaffold"

Here is the log file $cat logs/gapless_scaffold.log 0:00:01.674690 Reading in original assembly 0:00:02.363788 Loading repeats 0:00:02.444654 Filtering mappings Traceback (most recent call last): File "/ppq/data1/software/gapless//", line 13326, in main(sys.argv[1:]) File "/ppq/data1/software/gapless//", line 13155, in main GaplessScaffold(args[0], args[1], args[2], min_mapq, min_mapping_length, min_length_contig_break, prefix, stats) File "/ppq/data1/software/gapless//", line 9068, in GaplessScaffold mappings, cov_counts, cov_probs, read_names, read_len = ReadMappings(mapping_file, contig_ids, min_mapq, min_mapping_len gth, keep_all_subreads, alignment_precision, num_read_len_groups, pdf) File "/ppq/data1/software/gapless//", line 432, in ReadMappings PlotHist(pdf, "Mapping quality", "# Mappings", mappings['mapq'], threshold=min_mapq, logy=True) File "/ppq/data1/software/gapless//", line 262, in PlotHist ax.set_yscale('log', nonpositive='clip') File "/ppq/data1/software/anaconda3/lib/python3.7/site-packages/matplotlib/axes/", line 3531, in set_yscale ax.yaxis._set_scale(value, kwargs) File "/ppq/data1/software/anaconda3/lib/python3.7/site-packages/matplotlib/", line 771, in _set_scale self._scale = mscale.scale_factory(value, self, kwargs) File "/ppq/data1/software/anaconda3/lib/python3.7/site-packages/matplotlib/", line 573, in scale_factory return _scale_mapping[scale](axis, **kwargs) File "/ppq/data1/software/anaconda3/lib/python3.7/site-packages/matplotlib/", line 253, in init "{!r}".format(kwargs)) ValueError: provided too many kwargs, can only pass {'basex', 'subsx', nonposx'} or {'basey', 'subsy', nonposy'}. You passe d {'nonpositive': 'clip'}

schmeing commented 2 years ago

Hi, thx for reporting it. What version of matplotlib are you using? It appears to have different options for setting axis to log. If it is newer than 3.4.2 I will update the code to get it to work. Otherwise, please update your matplotlib package.

splaisan commented 2 years ago

Hi, my matplotlib packages are v3.5.2 Thanks for your help

xhu556 commented 2 years ago

my matplotlib packages are v3.5.2 too. Thanks for your help

schmeing commented 2 years ago

I tried to reproduce this issue, but unfortunately I could not. I created a new conda environment using:

conda create -c conda-forge --name gapless python pandas numpy scipy seaborn matplotlib=3.5.2 pillow biopython

However, it runs through perfectly for my data and has no issues in the plotting. Furthermore, I checked the recent documentation of matplotlib and this is still the recommended way of setting the log scale:

Can you provide me with a command to create a conda environment that gives this crash. I tried a few packages, but I did not manage to change versions in a way to reproduce the crash. Thank you.

schmeing commented 2 years ago

Or is this something specific to a single dataset and something goes wrong in a way that is different from the error message that it outputs?

schmeing commented 2 years ago

Since I did not get a reply for a month, I hope this issue is fixed. In case it is not, please reopen.