raja-appuswamy / accel-align-release

A fast seed-embed-extend based sequence mapper and aligner
MIT License
22 stars 6 forks source link

MAPQ calculation question #11

Open holtjma opened 3 years ago

holtjma commented 3 years ago

Hello,

I was testing accelalign performance, both in terms of speed and downstream accuracy. It seems to perform quite fast, but I finding issues with downstream analysis. Some of the tools (e.g. manta) I'm using are simply throwing errors that seem to be related to the outputs from accelalign:

[2021-07-24T01:06:12.111652Z] [hpc0019] [27174_1] [WorkflowRunner] [ERROR] [2021-07-24T01:06:04.247057Z] [hpc0019] [27174_1] [getAlignmentStats_generateStats_000] std::exception::what: Too few high-confidence read pairs (0) to determine pair orientation for read group '' in bam file '<path>/accelalign-20210723/HALB3002753.bam'
[2021-07-24T01:06:12.111652Z] [hpc0019] [27174_1] [WorkflowRunner] [ERROR] [2021-07-24T01:06:04.247993Z] [hpc0019] [27174_1] [getAlignmentStats_generateStats_000]  At least 100 high-confidence read pairs are required to determine pair orientation.
[2021-07-24T01:06:12.111652Z] [hpc0019] [27174_1] [WorkflowRunner] [ERROR] [2021-07-24T01:06:04.249032Z] [hpc0019] [27174_1] [getAlignmentStats_generateStats_000]  Total sampled reads: 1168049539
[2021-07-24T01:06:12.111652Z] [hpc0019] [27174_1] [WorkflowRunner] [ERROR] [2021-07-24T01:06:04.249868Z] [hpc0019] [27174_1] [getAlignmentStats_generateStats_000]  Total sampled paired reads: 1168049539
[2021-07-24T01:06:12.111652Z] [hpc0019] [27174_1] [WorkflowRunner] [ERROR] [2021-07-24T01:06:04.251055Z] [hpc0019] [27174_1] [getAlignmentStats_generateStats_000]  Total sampled paired reads passing MAPQ filter: 1159260736
[2021-07-24T01:06:12.111652Z] [hpc0019] [27174_1] [WorkflowRunner] [ERROR] [2021-07-24T01:06:04.252130Z] [hpc0019] [27174_1] [getAlignmentStats_generateStats_000]  Total sampled high-confidence read pairs passing all filters: 0

Other tools will run, but produce results that are not great, so I'm wondering if there's something missing in the outputs.

I noticed the MAPQ scores seem to be lower than other aligners I've tested with, is there some difference in the score calculations? Or is there something incomplete with the SAM read pair outputs that might trigger this?

raja-appuswamy commented 3 years ago

Hello, Thank you for trying out the aligner and for the feedback.

As you noticed, AccelAlign does produce lower MAPQ as there is indeed a difference in calculation (the metric used to score best and second best alignments are different in AA vs other aligners to be more precise). This makes AccelAlign's MAPQ estimation more conservative. We have tested variant calling with Haplotypecaller without any trouble, but not with Manta. As you suggested, it is possible that Manta's strict quality filtering removes reads mapped by AA. This seems to be a "limitation"/design aspect of Manta described here (https://github.com/Illumina/manta/issues/104).

Could you perhaps share some details about the alignment? In particular,

  1. Did you compare AccelAlign with other aligners in terms of mapping? Does it map significantly less? Did you check if is there concordance in terms of mapping location?
  2. How long are your reads?
  3. We can provide a version of AccelAlign that performs more traditional MAPQ estimation based on alignment scores. Would you be interested in testing it?
holtjma commented 3 years ago

Yea, happy to share a bit more info.

  1. I'm looking at a more end-to-end performance. We benchmark human datasets using the GIAB samples and so I'm looking at aligner/caller pairings (including SV/CNV callers when the benchmark is available). For accel-align, it really only performed "well" (i.e. >99% recall/precision) when paired with the octopus variant caller. When paired with strelka2, the results were not great (~98%/97% recall/precision). When paired with clair3, it just threw an error and didn't report anything. For context, all three of these callers have historically worked quite well with sentieon (i.e. bwa-mem). As noted, manta also threw an error.
  2. These are 151bp PCR-free, paired-end Illumina reads.
  3. Sure, I can do that if it's available. I'll have to review the install again, as I remember it being a bit non-trivial, but it should be testable.
raja-appuswamy commented 3 years ago

Thank you Matt. We would like to test this on our side here before giving you an update. We are looking into testing other variant callers with the NA12878 whole genome sequencing data (accession ERR194147 https://www.ebi.ac.uk/ena/browser/view/PRJEB3381?show=reads). Would this be similar to your dataset? It would help us a lot if you could point us at the exact data you used for reproducing the result, i.e if possible and if it is publicly available.

Also please let us know if you did any post processing between alignment and variant calling.

holtjma commented 3 years ago

In general, it should be similar. We're using NA12878 (HG001) along with HG002-HG005 and the corresponding GIAB benchmark truth sets. Currently, these are samples we've sequenced internally for clinical validations. However, if you have issues replicating the issue, I can ask about sharing with you for testing.

Here's the steps we're currently doing, I can elaborate with exact commands if necessary:

  1. For each read pair, align with accelalign
  2. Pipe into samtools addreplacerg to get read group information in the output
  3. Pipe into bwa-kit postalt processing
  4. Pipe into sentieon sort to get a sorted BAM file
  5. Merge all the BAM files with sentieon LocusCollector and Dedup steps (we would normally do BQSR recalibration at this point, but I didn't add that step in yet)
  6. Provide that BAM to the various variant calling softwares
  7. Provide the output VCF to RTG vcfeval to get benchmark results
yanlolo commented 3 years ago

hey Matt,

Thank you very much for your feedback. We tested our aligner with several variant callers you suggested and we have added some new features in our new code to fix certain aspects:

  1. Read name cleaning: we now extract and report a part of the read name as this broke clair3
  2. We found that variant callers other than the gatk haplotype had some issues as we did not support soft clipping (particularly Manta). We now support v1.0 of softclipping. It is not enabled by default. But you can use the -s flag for enabling it.
  3. As Raja mentioned earlier, by default, MAPQ is based on embeded hamming distance. This led to mapq values being conservative. Now, we now have added an option to enable to MAPQ calculation based on alignment score. This can be enabled with ‘-a’ flag.

We also did some quick tests based on the dataset NA12878 (76bp dataset from the paper). We used the default parameter of the variant callers. Here's a high level summary of what we saw:

For haplotypecaller: BWA-MEM: snp: 0.959, indel: 0.959 Accel-align:snp: 0.920, indel:0.912

For Octopus: The vcf file generated by octopus is vcf v4.3 which is not yet supported by vcftools to split snp/indel. So we compute the f-score for all variants: BWA-MEM: 0.944 Accel-align:0.940

For strelka2: BWA-MEM: snp: 0.772, indel:0.272 Accel-align:snp: 0.778, indel:0.256

For manta: As manta focuses on SV and indels, not SNPs. So we only report the f-score for the small indels here: BWA-MEM: 0.966 Accel-align: 0.970 Note: please use the option to enable softclipping by ‘-s’, otherwise manta may not work

Clair3: First, clair had some issues with long read names. We have updated our code to solve this. So now it should work. But, clair seems to be very sensitive to errors in aligment. Unlike haplotypecaller, Clair seems to crash if a misaligned read leads to a conflicting variant call. We saw that Clair works if we filter aligned reads based on alignment score. If we do that, here are the results: bwa: snp 0.820, indel 0.879 accel-align: snp 0.824, indel 0.791

As you can see, Accel-Align is pretty close in terms of f-score to most BWA-MEM for SNPs. It's worse for indels under strelka2 and Clair3, but not under Manta or haplotypecaller. Ofcourse, the actual f-score varies across variant callers, but the relative results seem consistent.

Could you perhaps try rerunning AA with your dataset now with the following recommendations:

  1. You can directly run AA container (docker run -it rajaappuswamy/accel-align). We have updated it.
  2. Manta: enable soft clipping with -s. If you still get low MAPQ error, you could try -a flag for using AS-based MAPQ.
  3. Clair3: Filter out low-quality mappings before calling with clair
  4. Octopus: Seems to work for our dataset.

We are also happy to run it on our side if you provide us access to the datasets if that is possible.

Regards,

holtjma commented 3 years ago

Sure, give me a few days to get some time to re-try this on our internal datasets. I'll post back here with success/failure and if I encounter any issues.

I think last time I did a local install (we are currently experiencing some issues with docker/singularity), so is the master branch also inclusive of these changes?

raja-appuswamy commented 3 years ago

Yes. The master branch should include all these changes Matt.

holtjma commented 3 years ago

Let me caveat everything here because some of my tests (specifically deepvariant) are still running, but here is some preliminary feedback/observations:

Issues:

  1. -s - this option led to crashes in my tests due to an error check in either samtools addreplacerg or sentieon util sort because it detected a mismatch between the sequence length and the CIGAR string. I ended up just removing the option and haven't tested further with it. I can dig into this further if you want some more details.

Observations:

  1. It's quite fast, which is good! I'll need to go look to do a better comparison, but no complaints on that front. 👍
  2. I tested with two references: 1) hg38+decoy+alt contigs (hg38_asm5_alt) which we use for clinical data currently and is a fairly standard reference genome and 2) a recently released hg38 that has masked false duplicated regions in hg38 (hg38_GIAB_masked) available through this paper or more directly here. I also tested with multiple variant callers, and I'll point out observations on those as I go below.
  3. For some reason, all of the accel-align combinations with hg38_asm5_alt underperformed, with the highest recall only at 0.9828. I'm not sure why this is, but the rest of the results will focus on the hg38_GIAB_masked reference since it had some decent results.
  4. octopus variant caller - every sample on hg38_GIAB_masked crashed with a core dump; this may be more of an issue for octopus (it shouldn't full crash like that), but I also haven't had that issue with other aligners
  5. Most of the downstream callers (clair3, strelka2, DNAscope) have a low precision, in the 0.97-0.98 range; deepvariant is the one exception with >0.99. For context, all five of those (including octopus) when run with sentieon (e.g. bwa-mem) have precision >0.9925.
  6. Sensitivity is also lower for some of the callers. Strelka2 was 0.9926, deepvariant 0.9920, dnascope 0.9977, and clair3 0.9829. Compared to the bwa-mem processed bams: strelka2 0.9980, deepvariant 0.9986, dnascope 0.9988, and clair3 0.9980.
  7. The challenging medically relevant autosomal genes (CMRG) benchmark show similar results, but more exaggerated because those regions are harder.
  8. I don't have any CNV/SV results because manta crashed (presumably this is because of the -s issue from earlier).

Okay, that was a lot of text, but I hope some of it is helpful.

raja-appuswamy commented 3 years ago

Thank you very much Matt for this thorough testing. This is very interesting. It would be of great benefit to us if you could share the fastq files and scripts so that we can reproduce the same results here on our side and dig further into the problems. Do you think this would be possible?

Thanks again, Raja

holtjma commented 3 years ago

I'll check on the data side, we've shared some of those before since they're non-patient benchmark samples. Is there an email address (or similar way) I can pass this information along assuming it gets approved?

The scripts are a bit more complicated, it's part of a benchmarking pipeline we have that tests a variety of aligner/caller combinations using a snakemake process. It's also fairly baked into our metadata querying system. However, if there are specific commands you want to know, I should be able to share those. For example, this is what I have for the accelalign rule:

rule accelalign:
    input:
        fq1=ancient(atlas_tools.getFastq1),
        fq2=ancient(fq2_wrapper),
        index=getAccelAlignIndex
    output:
        bam=temp("{pipeline_dir}/single_alignments/{reference}/accelalign-{version}/{hafs_id}_{sample}_sorted.bam"),
        bai=temp("{pipeline_dir}/single_alignments/{reference}/accelalign-{version}/{hafs_id}_{sample}_sorted.bam.bai")
    params:
        accalign=f"{SOFTWARE_DOWNLOAD}/accel-align-{{version}}/accalign",
        tbb_lib=f"{SOFTWARE_DOWNLOAD}/oneTBB-2019_U5/build/linux_intel64_gcc_cc4.9.4_libc2.17_kernel3.10.0_release",
        sentieon=f"{SOFTWARE_PUBLIC}/sentieon-genomics-{SENTIEON_VERSION}/bin/sentieon",
        samtools=f"{JHOLT_HOME}/install-ES/bin/samtools",
        bwakit=f"{SOFTWARE_DOWNLOAD}/bwa.kit-0.7.15",
        rgoptions=lambda wildcards: getRGOptions(wildcards),
        tempParams=("--temp_dir {pipeline_dir}/tmp/sentieon_alignments" if USE_LOCAL_TEMP else ""),
        reference=lambda wildcards: getReferenceFasta(wildcards),
    log: "{pipeline_dir}/logs/single_alignments/{reference}/accelalign-{version}/{hafs_id}_{sample}_sorted.log"
    benchmark: "{pipeline_dir}/benchmark/single_alignments/{reference}/accelalign-{version}/{hafs_id}_{sample}_sorted.tsv"
    threads: THREADS_PER_PROC
    resources:
        mem_mb=24000
    shell:
        '''
        export LD_LIBRARY_PATH=${{LD_LIBRARY_PATH}}:{params.tbb_lib}
        {params.accalign} \
            -w \
            -t {threads} \
            -a \
            {params.reference} \
            {input.fq1} {input.fq2} | \
        {params.samtools} addreplacerg \
            -r "{params.rgoptions}" \
            - | \
        {params.bwakit}/k8 \
            {params.bwakit}/bwa-postalt.js \
            {params.reference}.alt | \
        {params.sentieon} util sort {params.tempParams} \
            --bam_compression 1 \
            -r {params.reference} \
            -o {output.bam} \
            -t {threads} \
            --sam2bam \
            -i -
        '''
raja-appuswamy commented 3 years ago

You can send it to me Matt at raja.appuswamy@eurecom.fr. Really appreciate this.

Raja

holtjma commented 3 years ago

There should be a couple emails in your inbox now.