sc932 / ALE

Assembly Likelihood Estimator
Other
32 stars 7 forks source link

PacBio reads #12

Closed JohnUrban closed 6 years ago

JohnUrban commented 7 years ago

Hi,

Thanks for ALE -- I know you have moved on to other stuff, so feel free to ignore this -- but it will be quick to answer I think.

How do you recommend PacBio reads be mapped for use with ALE? I have done:

bwa mem -t $MTHREADS -M -x pacbio $BWAIDX $FASTQ

Thanks,

John

JohnUrban commented 7 years ago

In case you or someone ever does develop ALE more, I found this error with a set of long noisy reads mapped as above:

ALE ${BAM} ${REF} ${BASE}.ALE.txt >> ${BASE}.ale.err
.
.
.

ALE: ALEhelpers.c:113: getQtoLogPMiss: Assertion `idx >= 0 && idx < 63' failed.
/var/spool/slurmd/job15141159/slurm_script: line 7: 20985 Aborted                 ALE ${BAM} ${REF} ${BASE}.ALE.txt >> ${BASE}.ale.err

If it helps, the MAPQ range in that BAM file is 0-60.

robegan21 commented 7 years ago

So ALE with PacBio reads does work fine with any mapper that makes a proper BAM file. BWA should be a fine choice. However, this assertion that you uncovered would indicate that your BAM file is improper. I would not expect PacBio reads to have a Q score <0 or > 63, especially for PacBio with its relatively higher error rate which should really cap out at well below 30, even for consensus sequences. A valid Q score is very important for the execution of ALE.

JohnUrban commented 7 years ago

Thank you. I had some nanopore reads I used as well. Though nanopore reads are also noisy, there are small stretches that have base qualities up into the 90s. There are 2 approaches I tried. (1) Re-writing the fastqs with any Q>62 converted to 62, then mapping with BWA and running the normal ALE (I suppose I could have modified the BAM instead). (2) Modifying ALEhelper.c and ALEhelper.h to allow Q up to 99 to run on those BWA-mapped BAM files. I have not yet done a thorough comparison -- but at first glance it seems like both approaches give similar results.

What do you think?

JohnUrban commented 7 years ago

btw @robegan21 -- is there any chance we met at one of the ONT conferences?

JohnUrban commented 7 years ago

I somewhat have it working for long reads. If I run it on the nanopore reads with the defaults for realign, all goes well -- but when I use the suggested parameters for realign for pacbio reads, it raises this error:

Alignment score and position are not consensus. score: 15212 max: 15194, band_width: 5336, width_d: 10673, readLen: 36549, s2: -2147483648
/var/spool/slurmd/job15317951/slurm_script: line 9: 13457 
Segmentation fault      ALE --realign=1,5,2,1,20 ${BAM} ${REF} ${BASE}.ALE.txt >> ${BASE}.ale.err

update: actually I got the same error with PacBio reads:

Alignment score and position are not consensus. score: 2767 max: 2758, band_width: 9910, width_d: 19821, readLen: 18764, s2: -2147483648
/var/spool/slurmd/job15147081/slurm_script: line 9: 16041 Segmentation fault      ALE --realign=1,5,2,1,20 ${BAM} ${REF} ${BASE}.ALE.txt >> ${BASE}.ale.err

Any insight welcome. Thanks again.

Update -- using print statements, it appears that after the if statement at line 584 in ssw.c (https://github.com/sc932/ALE/blob/master/src/ssw.c#L584) is executed:

if (s2 < 0 || s1 < 0) {
            fprintf(stderr, "Alignment score and position are not consensus. score: %d max: %d, band_width: %d, width_d: %d, readLen: %d, s2: %d\n", score, max, band_width, width_d, readLen, s2);
            break;
        }

the program continues up until the while loop at line 650 (ssw.c: https://github.com/sc932/ALE/blob/master/src/ssw.c#L650). It enters the while loop and stays in it for a while before the segmentation fault occurs. Here are some values of relevant variables if it is of any help:

score: 15212 max: 0, band_width: 2668, width_d: 10673, readLen: 36549, s2: -2147483648, i: 18273, j: 2669, e: 1, l: 21200, f: 0, temp2: 2

It does not appear to get past this while loop as a print statement inserted between lines 698 and 699 (https://github.com/sc932/ALE/blob/master/src/ssw.c#L698) is never executed.

robegan21 commented 7 years ago

Hi John,

Can you send me a snippet from your bam file and reference, so I can try to reproduce? In the past every time I found one of these assertions violated, the aligner used had a bug of some sort, but this assertion is in the SmithWaterman library that we use and looks like an int32 overflow in that code -- I don't want to dive into there at the moment.

The only reason why ALE would use the SmithWaterman is if you called with the (non-default) --realign option. By default ALE takes the alignment from the CIGAR string from the original aligner... So can you try with out --realign?

Also I still maintain that any Q score over 30 for ONT is an error in the aligning application.. it is no where near that accurate... Are you sure that there is no Qscore base mixup? I.e. fastq is encoded in Illumina's standard breaking 64 but the aligner is interpreting it to be is the standard 33 encoding? ALE very much relies on the Q scores to closely correlate with the observed error rate.

-Rob (and yes, I remember you from one of the ONT conferences. I'll be at LC2017 too)

JohnUrban commented 7 years ago

Hi Rob,

Thanks for getting back to me. I wasn't sure if the re-align step was always performed or not since it was described as having default parameters (I assumed that meant it was a default step). Since you say it is only performed if it is called -- then I can confirm that ALE works fine when I do not specify --realign. That is good news to me. I will just use ALE on the long reads without specifying --realign. If you still want some data to reproduce what I was seeing, then let me know.

When you say, "any Q score over 30 for ONT is an error in the aligning application", do you mean that you think the aligner is changing the quality score string? These quality scores are also in the fastq files. I have data from the very early days of MAP: back then the quality scores assigned to bases were mostly low (as you'd expect) with these extreme outliers. I believe this was just an artifact of how they computed qualities at the time. I can share some fastqs with this if you'd like. Maybe you will be able to correct me -- which would be welcome of course.

Best,

John p.s. Enjoy LC2017. I have missed the last few conferences and likely will miss this one as well as I am in a transitioning period (defended thesis recently).

robegan21 commented 6 years ago

closing this old ticket. My comment is that I have never seen an ONT run where bases with a Q score that high without some 3rd party correction / polishing tool, and even then would not trust a Q score that high to be the correct base 99.9% of the time...

StarburstCLA commented 2 years ago

To add to the above discussion. PacBio reads in their raw form don't have Q scores as far as I understand (not checked a file personally may well be wrong) However PacBio HiFi reads, which are heavily self-aligned in the normal pipeline to eliminate errors now have a Q score of 0 - 93 often reaching the top limits. https://en.wikipedia.org/wiki/FASTQ_format

To get them to play with ALE I had to put in a limit on the fastq Q Scores to be in a 0-62 range also. Before performing the alignment used for ALE.

I used the following which is a script in bbmap but others are available.

reformat.sh -Xmx8g qin=33 maxcalledquality=62 in=reads.fasta.gz out=reads.fastq.gz