kcleal / dysgu

Toolkit for calling structural variants using short or long reads
MIT License
92 stars 11 forks source link

Bug in re-mapping? #48

Closed johannesgeibel closed 2 years ago

johannesgeibel commented 2 years ago

Hi, I am currently testing dysgu on some chicken samples. For some samples, sequenced with 2*151bp paired-end Illumina reads, I encounter an error if --remap=True:

...
2022-08-09 13:41:43,692 [INFO   ]  Number of matching SVs from --sites 101804
Traceback (most recent call last):
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/bin/dysgu", line 8, in <module>
    sys.exit(cli())
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/dysgu/main.py", line 444, in call_events
    cluster.cluster_reads(ctx.obj)
  File "dysgu/cluster.pyx", line 1337, in dysgu.cluster.cluster_reads
  File "dysgu/cluster.pyx", line 1192, in dysgu.cluster.pipe1
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/dysgu/re_map.py", line 427, in remap_soft_clips
    gstart, ref_seq_big, idx)
  File "/home/uni08/geibel/chicken/chicken_sv/test/testDysgou/.snakemake/conda/023852c9410a500a0b2051e5e324c328/lib/python3.7/site-packages/dysgu/re_map.py", line 279, in process_contig
    e.ref_seq = ref_seq_clipped[500 - 1]
IndexError: string index out of range

The chicken reference genome actually has some small contigs < 500bp, but I'm not sure whether ref_seq_clipped holds the reference contig. Further, I would expect the error then for all samples when force-calling, but it appears only in 4 out of 6 test samples. Do you have any clue whether this could cause the problem? Thanks, Johannes

kcleal commented 2 years ago

Hi @johannesgeibel, I think you are probably correct, the error looks like it is caused by of those short reference contigs. I have modified the function so it skips the re-mapping step for short reference contigs < 1kb::

        if ref_genome.get_reference_length(e.chrA) < 1000 or ref_genome.get_reference_length(e.chrB) < 1000:
            ....
            continue

Is this a satisfactory fix for you? I guess the remapping could also be tried for short reference contigs also, but its unclear how robust it would be for very short reference contigs anyway. I have uploaded the current fix, but you will have to build from source whilst I get the patch uploaded to pypi.

johannesgeibel commented 2 years ago

Hi @kcleal, thanks. The fix seems to solve the issue. Excluding the very small contigs should not cause any larger trouble, as they are anyway disregarded in most cases. It's just common practice to keep them in the mapping process to reduce mismappings.