schmeing / gapless

Gapless provides combined scaffolding, gap-closing and assembly correction with long reads
MIT License
32 stars 4 forks source link

IndexError: Boolean index has wrong length: 115672 instead of 115669 #9

Open grpiccoli opened 1 year ago

grpiccoli commented 1 year ago

Hello @schmeing

I changed the Hi-C scaffolding algorithm (YAHS to EndHiC) and a previous issue with pure deletions went away, but now I got an issue with the duplication boolean index not having the same size as the duplications dataframe.

`0:00:18.075300 Reading in original assembly 0:00:19.136065 Loading repeats 0:00:37.415395 Filtering mappings 0:01:28.951904 Search for possible break points 1:26:15.808132 Search for possible bridges 1:28:21.136944 Scaffold the contigs Start 81914 Iteration 1 0:00:05.637656 52211 0:01:57.747884 37318 Iteration 2 0:00:01.071638 36312 0:00:50.240074 34525 Iteration 3 0:00:00.818671 34450 0:00:38.636074 34185 Iteration 4 0:00:00.681951 34171 0:00:33.416415 34125 Iteration 5 0:00:00.592918 34123 0:00:33.210053 34110 Iteration 6 0:00:00.534959 34110 0:00:32.124315 34109 Iteration 7 0:00:00.527677 34109 0:00:32.156766 34108 Iteration 8 0:00:00.526948 34108 0:00:32.149360 34108 RemoveDuplicates 33697 Iteration 1 0:00:00.504654 33697 0:00:35.549221 33442 Iteration 2 0:00:00.667288 33424 0:00:36.403955 33369 Iteration 3 0:00:00.619044 33363 0:00:32.608814 33354 Iteration 4 0:00:00.514491 33353 0:00:32.797004 33349 Iteration 5 0:00:00.504719 33349 0:00:32.317836 33349 PlaceUnambigouslyPlaceables 32781 Iteration 1 0:00:00.500207 32781 0:00:39.784164 32438 Iteration 2 0:00:00.729437 32413 0:00:37.434524 32353 Iteration 3 0:00:00.574755 32351 0:00:36.378086 32343 Iteration 4 0:00:00.494534 32343 0:00:35.192668 32342 Iteration 5 0:00:00.492627 32342 0:00:35.138941 32342 CombineOnMatchingExtensions 29295 TrimAmbiguousOverlap Traceback (most recent call last): File "/nesi/nobackup/vuw03529/bin/test/gapless/gapless.py", line 13362, in main(sys.argv[1:]) File "/nesi/nobackup/vuw03529/bin/test/gapless/gapless.py", line 13189, in main GaplessScaffold(args[0], args[1], args[2], min_mapq, min_mapping_length, min_length_contig_brea k, large_reads, large_contigs, prefix, stats) File "/nesi/nobackup/vuw03529/bin/test/gapless/gapless.py", line 9121, in GaplessScaffold scaffold_paths, trim_repeats = ScaffoldContigs(contig_parts, bridges, mappings, cov_probs, repe ats, prob_factor, min_mapping_length, max_dist_contig_end, prematurity_threshold, ploidy, maxloop units) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^ File "/nesi/nobackup/vuw03529/bin/test/gapless/gapless.py", line 7864, in ScaffoldContigs scaffold_paths = TraverseScaffoldGraph(scaffolds, scaffold_graph, graph_ext, scafbridges, org scaf_conns, ploidy, max_loop_units) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/nesi/nobackup/vuw03529/bin/test/gapless/gapless.py", line 7437, in TraverseScaffoldGraph scaffold_paths = TrimAmbiguousOverlap(scaffold_paths, scaffold_graph, ploidy) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/nesi/nobackup/vuw03529/bin/test/gapless/gapless.py", line 7254, in TrimAmbiguousOverlap scaffold_paths = RemoveDuplicates(scaffold_paths, True, ploidy) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/nesi/nobackup/vuw03529/bin/test/gapless/gapless.py", line 6774, in RemoveDuplicates duplications = duplications.loc[duplications.merge(rem_paths, on=['apid','ahap','bpid','bhap'], how='left', indicator=True)['_merge'].values == "left_only", ['apid','ahap']].copy() # Paths that are part of a larger part


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nesi/nobackup/vuw03529/bin/conda/gapless_mamba/lib/python3.11/site-packages/pandas/core/in
dexing.py", line 1067, in __getitem__
    return self._getitem_tuple(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nesi/nobackup/vuw03529/bin/conda/gapless_mamba/lib/python3.11/site-packages/pandas/core/in
dexing.py", line 1256, in _getitem_tuple
    return self._getitem_tuple_same_dim(tup)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nesi/nobackup/vuw03529/bin/conda/gapless_mamba/lib/python3.11/site-packages/pandas/core/in
dexing.py", line 924, in _getitem_tuple_same_dim
    retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nesi/nobackup/vuw03529/bin/conda/gapless_mamba/lib/python3.11/site-packages/pandas/core/in
dexing.py", line 1292, in _getitem_axis
    return self._getbool_axis(key, axis=axis)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nesi/nobackup/vuw03529/bin/conda/gapless_mamba/lib/python3.11/site-packages/pandas/core/indexing.py", line 1091, in _getbool_axis
    key = check_bool_indexer(labels, key)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nesi/nobackup/vuw03529/bin/conda/gapless_mamba/lib/python3.11/site-packages/pandas/core/indexing.py", line 2571, in check_bool_indexer
    return check_array_indexer(index, result)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nesi/nobackup/vuw03529/bin/conda/gapless_mamba/lib/python3.11/site-packages/pandas/core/indexers/utils.py", line 552, in check_array_indexer
    raise IndexError(
IndexError: Boolean index has wrong length: 115672 instead of 115669`

I can bypass this error by forcing the boolean index to have the same length

` 
    # Merge the dataframes and keep track of where the data came from
    merged_df = duplications.merge(rem_paths, on=['apid','ahap','bpid','bhap'], how='left', indicator=True)
    # Filter based on the '_merge' column
    duplications = merged_df.loc[merged_df['_merge'] == 'left_only', ['apid','ahap']].copy()
`

Could that produce any other issues down the pipeline?