Open holtjma opened 3 years ago
Hello, Thank you for trying out the aligner and for the feedback.
As you noticed, AccelAlign does produce lower MAPQ as there is indeed a difference in calculation (the metric used to score best and second best alignments are different in AA vs other aligners to be more precise). This makes AccelAlign's MAPQ estimation more conservative. We have tested variant calling with Haplotypecaller without any trouble, but not with Manta. As you suggested, it is possible that Manta's strict quality filtering removes reads mapped by AA. This seems to be a "limitation"/design aspect of Manta described here (https://github.com/Illumina/manta/issues/104).
Could you perhaps share some details about the alignment? In particular,
Yea, happy to share a bit more info.
Thank you Matt. We would like to test this on our side here before giving you an update. We are looking into testing other variant callers with the NA12878 whole genome sequencing data (accession ERR194147 https://www.ebi.ac.uk/ena/browser/view/PRJEB3381?show=reads). Would this be similar to your dataset? It would help us a lot if you could point us at the exact data you used for reproducing the result, i.e if possible and if it is publicly available.
Also please let us know if you did any post processing between alignment and variant calling.
In general, it should be similar. We're using NA12878 (HG001) along with HG002-HG005 and the corresponding GIAB benchmark truth sets. Currently, these are samples we've sequenced internally for clinical validations. However, if you have issues replicating the issue, I can ask about sharing with you for testing.
Here's the steps we're currently doing, I can elaborate with exact commands if necessary:
samtools addreplacerg
to get read group information in the outputhey Matt,
Thank you very much for your feedback. We tested our aligner with several variant callers you suggested and we have added some new features in our new code to fix certain aspects:
We also did some quick tests based on the dataset NA12878 (76bp dataset from the paper). We used the default parameter of the variant callers. Here's a high level summary of what we saw:
For haplotypecaller: BWA-MEM: snp: 0.959, indel: 0.959 Accel-align:snp: 0.920, indel:0.912
For Octopus: The vcf file generated by octopus is vcf v4.3 which is not yet supported by vcftools to split snp/indel. So we compute the f-score for all variants: BWA-MEM: 0.944 Accel-align:0.940
For strelka2: BWA-MEM: snp: 0.772, indel:0.272 Accel-align:snp: 0.778, indel:0.256
For manta: As manta focuses on SV and indels, not SNPs. So we only report the f-score for the small indels here: BWA-MEM: 0.966 Accel-align: 0.970 Note: please use the option to enable softclipping by ‘-s’, otherwise manta may not work
Clair3: First, clair had some issues with long read names. We have updated our code to solve this. So now it should work. But, clair seems to be very sensitive to errors in aligment. Unlike haplotypecaller, Clair seems to crash if a misaligned read leads to a conflicting variant call. We saw that Clair works if we filter aligned reads based on alignment score. If we do that, here are the results: bwa: snp 0.820, indel 0.879 accel-align: snp 0.824, indel 0.791
As you can see, Accel-Align is pretty close in terms of f-score to most BWA-MEM for SNPs. It's worse for indels under strelka2 and Clair3, but not under Manta or haplotypecaller. Ofcourse, the actual f-score varies across variant callers, but the relative results seem consistent.
Could you perhaps try rerunning AA with your dataset now with the following recommendations:
We are also happy to run it on our side if you provide us access to the datasets if that is possible.
Regards,
Sure, give me a few days to get some time to re-try this on our internal datasets. I'll post back here with success/failure and if I encounter any issues.
I think last time I did a local install (we are currently experiencing some issues with docker/singularity), so is the master branch also inclusive of these changes?
Yes. The master branch should include all these changes Matt.
Let me caveat everything here because some of my tests (specifically deepvariant) are still running, but here is some preliminary feedback/observations:
Issues:
-s
- this option led to crashes in my tests due to an error check in either samtools addreplacerg
or sentieon util sort
because it detected a mismatch between the sequence length and the CIGAR string. I ended up just removing the option and haven't tested further with it. I can dig into this further if you want some more details. Observations:
-s
issue from earlier).Okay, that was a lot of text, but I hope some of it is helpful.
Thank you very much Matt for this thorough testing. This is very interesting. It would be of great benefit to us if you could share the fastq files and scripts so that we can reproduce the same results here on our side and dig further into the problems. Do you think this would be possible?
Thanks again, Raja
I'll check on the data side, we've shared some of those before since they're non-patient benchmark samples. Is there an email address (or similar way) I can pass this information along assuming it gets approved?
The scripts are a bit more complicated, it's part of a benchmarking pipeline we have that tests a variety of aligner/caller combinations using a snakemake process. It's also fairly baked into our metadata querying system. However, if there are specific commands you want to know, I should be able to share those. For example, this is what I have for the accelalign rule:
rule accelalign:
input:
fq1=ancient(atlas_tools.getFastq1),
fq2=ancient(fq2_wrapper),
index=getAccelAlignIndex
output:
bam=temp("{pipeline_dir}/single_alignments/{reference}/accelalign-{version}/{hafs_id}_{sample}_sorted.bam"),
bai=temp("{pipeline_dir}/single_alignments/{reference}/accelalign-{version}/{hafs_id}_{sample}_sorted.bam.bai")
params:
accalign=f"{SOFTWARE_DOWNLOAD}/accel-align-{{version}}/accalign",
tbb_lib=f"{SOFTWARE_DOWNLOAD}/oneTBB-2019_U5/build/linux_intel64_gcc_cc4.9.4_libc2.17_kernel3.10.0_release",
sentieon=f"{SOFTWARE_PUBLIC}/sentieon-genomics-{SENTIEON_VERSION}/bin/sentieon",
samtools=f"{JHOLT_HOME}/install-ES/bin/samtools",
bwakit=f"{SOFTWARE_DOWNLOAD}/bwa.kit-0.7.15",
rgoptions=lambda wildcards: getRGOptions(wildcards),
tempParams=("--temp_dir {pipeline_dir}/tmp/sentieon_alignments" if USE_LOCAL_TEMP else ""),
reference=lambda wildcards: getReferenceFasta(wildcards),
log: "{pipeline_dir}/logs/single_alignments/{reference}/accelalign-{version}/{hafs_id}_{sample}_sorted.log"
benchmark: "{pipeline_dir}/benchmark/single_alignments/{reference}/accelalign-{version}/{hafs_id}_{sample}_sorted.tsv"
threads: THREADS_PER_PROC
resources:
mem_mb=24000
shell:
'''
export LD_LIBRARY_PATH=${{LD_LIBRARY_PATH}}:{params.tbb_lib}
{params.accalign} \
-w \
-t {threads} \
-a \
{params.reference} \
{input.fq1} {input.fq2} | \
{params.samtools} addreplacerg \
-r "{params.rgoptions}" \
- | \
{params.bwakit}/k8 \
{params.bwakit}/bwa-postalt.js \
{params.reference}.alt | \
{params.sentieon} util sort {params.tempParams} \
--bam_compression 1 \
-r {params.reference} \
-o {output.bam} \
-t {threads} \
--sam2bam \
-i -
'''
You can send it to me Matt at raja.appuswamy@eurecom.fr. Really appreciate this.
Raja
There should be a couple emails in your inbox now.
Hello,
I was testing accelalign performance, both in terms of speed and downstream accuracy. It seems to perform quite fast, but I finding issues with downstream analysis. Some of the tools (e.g. manta) I'm using are simply throwing errors that seem to be related to the outputs from accelalign:
Other tools will run, but produce results that are not great, so I'm wondering if there's something missing in the outputs.
I noticed the MAPQ scores seem to be lower than other aligners I've tested with, is there some difference in the score calculations? Or is there something incomplete with the SAM read pair outputs that might trigger this?