waveygang / wfmash

base-accurate DNA sequence alignments using WFA and mashmap3
MIT License
175 stars 19 forks source link

Incorrect CIGAR string generation in versions 0.16 through 0.21 #270

Open GRGong opened 1 month ago

GRGong commented 1 month ago

Dear wfmash developers,

I've identified an issue with CIGAR strings in PAF files generated by wfmash versions 0.16 and later. This problem appears to be related to the inversion patching feature introduced in v0.16.

Key points:

Example error (using rustybam break-paf -m 5000): toy.zip

thread 'main' panicked at src/paf.rs:71:43: called Result::unwrap() on an Err value: PafParseCigar { msg: "query bases 4000 from cigar does not equal 59000-55354=3646\nCM055321.1\t82983525\t55354\t59000\t-\tscaffold_1\t182733053\t172879498\t172883499\t3627\t4126\t9\tid:Z:\tcg:Z:3X1=2I1=3X1=1X2=1X2=1X1=1I1=1X1=1X1=1X1=2X2=1X3=2X1=1X1=1X1=1D2=1X3=1I1X1=1X2=1X1=59I3=2X1=1X2=2I2=1X1=1X1=1X2=1X3=3X1=2X1=2X1=1X2=3I1X1=2X3=1X4=1X3=1X1=1X1=1I1X2=4X1=1X2=1X1=1X7D2=1X1=2X4=1X1=1X3=1X3=1X3=4X1=1X4=1X1=1X1=1X1=5X8=1X2=2D1=2X1=1X1=5X1=1X2=1D1X5=1X1=2X1=1X2=1X1=1X3=1X1=78D38=20I110=3D2=2X18=1X67=1X419=1X20=3I367=1X113=1X63=1X82=1X332=1I17=1D84=1X32=1X161=1X25=1X123=1X225=1X157=3D21=1X24=1X282=2I278=1X214=1X46=1X3=5D3=1X2=1X1=2I1X3=1X1=1X2=4D1=2X1=3X1=1X4=1X1=4D1X1=2X3=1X2=2D1=1X1=1X2=1X1=1X2=1X2=2I1=1X2=2X2=1X5=1X4=3D1=1X1=2X1=1X3=1X2=6D3=3X4=3X3=3X1=1X1=2X1=3X1=1X1=1X1=2X1=3X2=2X1=1X1=2X2=1X1=2X3=1X2=1X1=4X1=1X1=2I3=1I1=1X3=2X2=1X1=2D1X2=3X2=3X1=2X1=3X3=1X1=1X3=2X2=3X4=1X2=1D3=3D1X4=2X1=1X1=2X3=1X3=3X1=2X1=4X1=1X4=2I1=1X2=3X1=1X2=2I1X1=1X4=1I2X1=1X1=1X1=2X3=1X2=1X1=1X2=11I1X4=3X3=1X2=1X1=2X2=1X1=1X1=7I1X3=1X\n" }

Steps to reproduce:

This issue does not occur with wfmash v0.15.

Could you please investigate this CIGAR string inconsistency? It would be helpful to understand if this is a bug or if there have been changes in the CIGAR string format that need to be addressed in downstream tools.

Thank you for your attention to this matter.

Best regards, Gaorui

ekg commented 1 month ago

The next release will resolve this. Thanks!

ekg commented 1 month ago

Does the current main HEAD resolve this issue? I've now integrated integration tests of PAF correctness, which should be equivalent to the SAM correctness using https://github.com/ekg/pafcheck.

GRGong commented 1 month ago

Thank you for the quick response! Unfortunately, I am working on a cluster that lacks some necessary libraries, and I am unable to compile wfmash from source. Would it be possible for you to provide a precompiled binary of wfmash?

baozg commented 1 month ago

@GRGong You could change the Dockerfile with wfmash HEAD for docker image. If you don't have access to root, singlularity remote builder would be help (https://cloud.sylabs.io/builder).

ekg commented 1 month ago

@GRGong here's a wfmash binary. I should probably make a release, but I prefer to do that once you've confirmed that this resolves the issues you're seeing. If not, we should resolve and add some automated tests to prevent future problems. Right now I'm testing SAM, PAF, and MAF conversion steps using github actions.

Just gunzip and make sure it's executable: wfmash-v0.21.0-38-gb731e41.gz

GRGong commented 4 weeks ago

@ekg Thanks for the binary. I tested the provided binary using my own genomes, but it still has the CIGAR problem. For your reference, I’ve uploaded the query and target FASTA files, along with the command I used and the error log.

Here is the link: https://drive.google.com/file/d/18MzFalZhVnKt-hTfTxmxI2KsdZxh6Zsf/view?usp=sharing

Note: The two genomes belong to divergent insect species, but they are still in the same subfamily. The previous version, wfmash v0.15, worked without issues.