haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
184 stars 19 forks source link

SAM output broken #6

Open dawe opened 3 years ago

dawe commented 3 years ago

Hi all, I understand SAM output is not optimized and it is discouraged, however I had to test it and I run into some issues. First of all, output is uncompressed and I couldn't pipe into samtools to obtain a bam file (to save space). However, once SAM is generated I have

$ samtools flagstat test.sam
[W::sam_parse1] mapped query cannot have zero coordinate; treated as unmapped
[W::sam_parse1] mapped query cannot have zero coordinate; treated as unmapped
[W::sam_parse1] mapped query cannot have zero coordinate; treated as unmapped
[W::sam_parse1] mapped query cannot have zero coordinate; treated as unmapped
[W::sam_parse1] mapped query cannot have zero coordinate; treated as unmapped
[W::sam_parse1] mapped query cannot have zero coordinate; treated as unmapped
[W::sam_parse1] mapped query cannot have zero coordinate; treated as unmapped
[W::sam_parse1] mapped query cannot have zero coordinate; treated as unmapped
[W::sam_parse1] mapped query cannot have zero coordinate; treated as unmapped
1509410 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
1509401 + 0 mapped (100.00% : N/A)
1509410 + 0 paired in sequencing
754705 + 0 read1
754705 + 0 read2
0 + 0 properly paired (0.00% : N/A)
1509401 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
1509401 + 0 with mate mapped to a different chr
1509401 + 0 with mate mapped to a different chr (mapQ>=5)

Strangely, all pairs are mapping on different chromosomes. More interestingly, when I convert to BAM:

$ samtools view -bo test.bam test.sam
[E::bam_write1] Positional data is too large for BAM format
samtools view: writing to "test.bam" failed

but then

$ samtools quickcheck test.bam 
$ echo $?
0

which makes it "compliant". But then, again,

$ samtools flagstat test.bam
717631 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
717631 + 0 mapped (100.00% : N/A)
717631 + 0 paired in sequencing
358801 + 0 read1
358830 + 0 read2
0 + 0 properly paired (0.00% : N/A)
717631 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
717631 + 0 with mate mapped to a different chr
717631 + 0 with mate mapped to a different chr (mapQ>=5)

Half of the reads disappeared from the alignment

haowenz commented 3 years ago

Thanks for testing. For us to reproduce the problem, can you share the test data and the command lines you used to run Chromap?

dawe commented 3 years ago

Hello, you can find example sequences here

I aligned to hg38 genome like this

chromap -r hg38.fa -x hg38.index -1 Sample_6_BC_AGGCTCCG_READ1.fq.gz -2 Sample_6_BC_AGGCTCCG_READ2.fq.gz --SAM -o test2.sam
mourisl commented 3 years ago

We just made an fix for the SAM format wich should resolve your issue of "Positional data too large". To pipe the output to samtools, you could specify the output file as /dev/stdout, and then pipe that to samtools. For the samtools flagstat issue, I think it is because Chromap does not report the mate chromosome and coordinate in the SAM format, so flagstat would regard them on different chromosomes. Does your analysis depend on this? Thank you.

dawe commented 3 years ago

We just made an fix for the SAM format wich should resolve your issue of "Positional data too large".

Excellent, I tried and everything seems to be solved, thanks

To pipe the output to samtools, you could specify the output file as /dev/stdout, and then pipe that to samtools.

Yep, named pipes work as well. I naively expected "no option" or -o - to work

For the samtools flagstat issue, I think it is because Chromap does not report the mate chromosome and coordinate in the SAM format, so flagstat would regard them on different chromosomes. Does your analysis depend on this? Thank you.

Good question. In the future I may be interested in calling some kind of variants (including structural ones) from these alignments, not only quantify the fragment abundance. I understand that chromap has been specifically designed for epigenomics, hence quantitative signals along the genome. A fully compliant SAM file would help, but I guess bwa-mem2 will be more than enough for now. Or maybe samtools fixmate

Maarten-vd-Sande commented 2 years ago

Just want to reiterate that compliant SAM would be nice :innocent:

NMaziak commented 2 years ago

Hello, I was wondering if there has been any solution to resolve chromap not reporting the mate chromosome and causing "mapping on different chromosome"? For the analysis I want to run a proper bam is preferred, and some packages then don't recognise paired-end data as paired.

haowenz commented 2 years ago

I know SAM is perhaps still the format that is used most of the times. But SAM file is large and would be slow to generate. That's why Chromap tries to skip generating SAM and get to the final results in BED or pairs.

We will improve SAM output gradually. It will take some time.

akundaje commented 1 year ago

Hi there. Is there an ETA on fixing the SAM output issue? We are considering using Chromap in a large NIH consortium pipeline but cannot use it if it does not support a valid SAM/BAM output format. We would really appreciate a fix.

mourisl commented 1 year ago

Hi @akundaje , we have resolved the mate pair issue of the SAM output in the version v0.2.4, and were tested in cases like #124. Are there any other issues with the SAM output you have in mind? Thank you for supporting Chromap.

akundaje commented 1 year ago

Oh ok that is great news. We will test it out and get back to you with any issues.