mummer4 / mummer

Mummer alignment tool
Artistic License 2.0
470 stars 108 forks source link

defective out.sam fomat? #210

Closed yaximik closed 6 months ago

yaximik commented 7 months ago

Successfully installed mummer-4.0.0rc1 on a Scientific Linux 7.9 box in conda environment. Then run alignment hg38 against gorGor6 ref assembly

$ nucmer -t 16 --sam-short=./Gor6-hg38_short.sam gorGor6.fa hg38.fa

after ~8 hours using ~ 59 GB memory got 761.8 MB sam file. However, further actions with samtools 1.20 failed:

$ samtools view -bT gorGor6.fa Gor6-hg38_short.sam [main_samview] fail to read the header from "Gor6-hg38_short.sam"

$ samtools view -H Gor6-hg38_short.sam [main_samview] fail to read the header from "Gor6-hg38_short.sam"

What could be the problem?

yaximik commented 6 months ago

It appeared that this problem is a long standing issue:

This broken output is BTW a long-standing mummer bug: see [mummer4/mummer#24](https://github.com/mummer4/mummer/issues/24) and https://github.com/mummer4/mummer/blob/master/src/umd/nucmer_main.cc#L107-L108

Many thanks to John Marshall for help with finding the compliance issue of header formatting in out.sam produced by nucmer, this appeared to be an easy fix:

    in code
mummer-4.0.0rc1/src/umd/nucmer_main.cc
    change
106    if(args.sam_short_given || args.sam_long_given) {
107      os << "@HD VN1.0 SO:unsorted\n"
108         << "@PG ID:nucmer PN:nucmer VN:4.0 CL:\"" << cmdline << "\"\n";
109    } else {
     to
106    if(args.sam_short_given || args.sam_long_given) {
107    os << "@HD\tVN:1.0\tSO:unsorted\n"
108        << "@PG\tID:nucmer\tPN:nucmer\tVN:4.0\tCL:" << cmdline << "\n";
109    } else {

then recompile as recommended. Now out.sam is compatible with all samtool versions.

$ nucmer --maxmatch --threads=4 --sam-short=./Sgor-hum.sam gor-mt.fa rCRS.fa
$ samtools view -H Sgor-hum.sam
@HD VN:1.0  SO:unsorted
@PG ID:nucmer   PN:nucmer   VN:4.0  CL:"nucmer --maxmatch --threads=4 --sam-short=./Sgor-hum.sam gor-mt.fa rCRS.fa"