jpuritz / dDocent

a bash pipeline for RAD sequencing
ddocent.com
MIT License
53 stars 41 forks source link

Suggested parameters for bam filtering #76

Open atfields opened 3 years ago

atfields commented 3 years ago

Hey,

I have been experimenting with some of my datasets and I have found that filtering the bam files prior to SNP calling not only reduces the number of SNPs which come out of freebayes, but also reduces the filtering needed to remove artifacts from my data. Using bamtools filter I remove

I think if you added this as a module before freebayes, then people could add the filters they wanted in a filter script if they wanted to use it.

Something like:

if [[ $FILTER = "yes" ]]; then
    if [ -z filter.txt ]; then
        echo "Please provide the bam filters"
        exit 1
    fi
    bamtools filter -script filter.txt -in cat-RRG.bam -out cat-RRG_filter.bam;
    mv cat-RRG_filter.bam cat-RRG.bam
fi

Then filter.txt scripts could be something like this:

{
"isPrimaryAlignment" : "true",
"isProperPair" : "true",
"mapQuality" : ">=40"
}

according to the bamtools filter manual.

pdimens commented 3 years ago

We're experimenting with an alternative way of generating the mapped.bed, and if this was to be implemented, I think a parsimonious way to implement it would be to generate the initial .bam files for the individuals by adding extra filters to samtools view (to avoid extra dependencies, fewer output files). For example, we can ask "filter alignment files?" in the prompt and default to asking about the three parameters you listed, or just a yes/no to execute all three at once without the nuance.

As an example, we can add samtools view -F 260 to pull out only primary alignments. Mapping quality and proper pairing would be flags -q 40 -f 0x2 respectively.

Altogether, that would look like (line 382):

... | samtools view -@$SAMProc -q 40  -F 260 -f 0x2 -SbT reference.fasta - > $i.bam 2>$i.bam.log
atfields commented 3 years ago

These can definitely be done in samtools view with flags, though by running them through bamtools, you do give the option of making it customizable without going directly into the dDocent script. Of course it can be done like the mapping values, with multiple questions, though if someone wishes to filter on a different parameters then what is provided they must either edit the script or do dDocent piecemeal with their filtering in the middle of it.

Another alternative would be to ask if they would like the defaults and if so, then the script could use samtools with those parameters, but if someone would like to provide their own then run it through bamtools. There are many ways to Rome.

pdimens commented 3 years ago

The best way to implement what you're describing is through Hollenbeck's secret snakemake-based dDocent and a readily-configurable config.yaml file.

jpuritz commented 3 years ago

I'd be ok with adding the primary alignments filter to the cat-RRG.bam file (I don't think freebayes uses them anyway), but I don't think reads should be filtered on "properly paired" status. "Properly paired" depends a lot on the estimated mapping interval (which is poorly estimated outside of shotgun sequencing), and if it doesn't match, even if the reads are on the same contig and the correct orientation, they will not be labeled as properly paired. I also think mapping quality of 40 is very high.

Outside of specific filtering concerns, I like the practice of each individual .bam file being inclusive (less filtering) and the cat-RRG.bam file being filtered. I have this practice in a newer WGS dDocent pipeline that I want to port over to here. It's also going to switch mostly to bcftools instead of VCFtools which is slower and now a pain with their inaccurate and stupid formatting error messages.

I would like to know more about snake-make based dDocent, @chollenbeck .

pdimens commented 3 years ago

@jpuritz Looking through dDocent, the only occurrence of vcftools is filtering TotalRawSNPs.vcf for 90% missingness. Tbh, I (and others I know who work with dDocent) begin their filtering on TRS instead of Final.recode.vcf.

So, regarding swapping vcftools -> bcftools, it can be done largely painlessly and we'd just need to change a handful of lines, but is the inclusion of that filtering step necessary, or is it a convenience feature?

Idk about performance difference, but vcfcombine from vcflib can be replaced with bcftools concat