msk-access / cwl-commandlinetools

Central location for CWL CommandLineTools
https://cmo-ci.gitbook.io/command-line-tools-cwl/
GNU Affero General Public License v3.0
5 stars 0 forks source link

Write CWL and Docker for Fgbio FilterConsensusReads #58

Closed rhshah closed 4 years ago

rhshah commented 4 years ago

Tool: http://fulcrumgenomics.github.io/fgbio/tools/latest/FilterConsensusReads.html Version: https://github.com/fulcrumgenomics/fgbio/releases/tag/1.2.0

Example Command:

Duplex BAM

/opt/common/CentOS_7/java/jdk1.8.0_202/bin/java -Djava.io.tmpdir=/scratch/ -Xmx4g -jar /home/arorak/repositories/fgbio/target/scala-2.13/fgbio-1.2.0-c8e1c41-SNAPSHOT.jar FilterConsensusReads --input /juno/work/bergerm1/bergerlab/arorak/Project_benchmark_fgbio/FqToBamFgbio_2020-06-10/NovaSeq_Plasma_Healthy_Donor/DONOR11-N/DONOR11-N.fqtobam.clip.map.md.abra2.fm.group.cons.map.abra2.fm.bam --output /juno/work/bergerm1/bergerlab/arorak/Project_benchmark_fgbio/FqToBamFgbio_2020-06-10/NovaSeq_Plasma_Healthy_Donor/DONOR11-N/DONOR11-N.fqtobam.clip.map.md.abra2.fm.group.cons.map.abra2.fm.duplex.bam -r /work/access/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta -M=2 1 1 -N=30 --reverse-per-base-tags=true

Simplex + Duplex :

/opt/common/CentOS_7/java/jdk1.8.0_202/bin/java -Djava.io.tmpdir=/scratch/ -Xmx4g -jar /home/arorak/repositories/fgbio/target/scala-2.13/fgbio-1.2.0-c8e1c41-SNAPSHOT.jar FilterConsensusReads --input /juno/work/bergerm1/bergerlab/arorak/Project_benchmark_fgbio/FqToBamFgbio_2020-06-10/NovaSeq_Plasma_Healthy_Donor/DONOR11-N/DONOR11-N.fqtobam.clip.map.md.abra2.fm.group.cons.map.abra2.fm.bam --output /juno/work/bergerm1/bergerlab/arorak/Project_benchmark_fgbio/FqToBamFgbio_2020-06-10/NovaSeq_Plasma_Healthy_Donor/DONOR11-N/DONOR11-N.fqtobam.clip.map.md.abra2.fm.group.cons.map.abra2.fm.simplex.pre.bam -r /work/access/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta -M=3 3 0 -N=30 --reverse-per-base-tags=true
rhshah commented 4 years ago

@ionox0 @andurill @kanika-arora we should discuss the parameters for this tool and understand them what is logical and what is not.

rhshah commented 4 years ago

I will make the CWL and put the options here so we can discuss them here

rhshah commented 4 years ago
option option short prefix type description requirement Max Values default value
reverse-per-base-tags R Boolean Reverse [complement] per base tags on reverse strand reads. Optional 1 false
min-reads M Int The minimum number of reads supporting a consensus base/read. Required 3  
max-read-error-rate E Double The maximum raw-read error rate across the entire consensus read. Required 3 0.025
max-base-error-rate e Double The maximum error rate for a single consensus base. Required 3 0.1
min-base-quality N PhredScore Mask (make N) consensus bases with quality less than this threshold. Required 1
max-no-call-fraction n Double Maximum fraction of no-calls in the read after filtering. Optional 1 0.2
min-mean-base-quality q PhredScore The minimum mean base quality across the consensus read. Optional 1  
require-single-strand-agreement s Boolean Mask (make N) consensus bases where the AB and BA consensus reads disagree (for duplex-sequencing only). Optional 1 false

@andurill @ionox0 @kanika-arora @murphycj2 we should discuss this option, from the above example command we are only using the

Please let us know your thoughts on

Obviously this will differ for Duplex and Unfiltered. But it will be good for all of us to be on the same page with regards to its usage.

ionox0 commented 4 years ago

One minor point, I believe 3 3 0 would be for "Simplex" + "Duplex" reads rather than "Unfiltered" (it wouldn't include singletons or 2-read families)

Apart from that, here are the equivalent features of Marianas:

Marianas doesn't do any sort of filtering that fgbio does for these parameters as far as I'm aware, so we shouldn't need to use them unless we are trying to make improvements:

Marianas does however include a min_mapq param to only collapse reads with adequate mapping qualities, but we've set that to 1 so it should have little effect.

As for the min-reads param we should check that we are using the correct definition of "Unfiltered" with 3 3 0 (which I believe to be simplex + duplex)

kanika-arora commented 4 years ago

I second what Ian said about 3 3 0, that file is the precursor to simplex BAM. The python script create_simplex_bam_from_consensus.py removes duplex reads from this BAM to create the simplex BAM.

@ionox0 I set min-base-quality to 30 because that is what was set in the script that Dilmi (and you?) had put together. I was hoping you could tell me why that was chosen. I am not sure if 30 is too stringent or not. Mike initially thought that it may be. It would be best to look at the quality score distribution of consensus bases to select a threshold.

Regarding require-single-strand-agreement, I believe if there is disagreement, marianas picks the reference base if that is one of the bases. And if both strands have non-reference bases, then it introduces a N. Not that we are trying to mimic Marianas, but there may not be a way to do that here. Still, I think we should set this option to true. But we should ask Mike too.

For all other options, I think what values we have right now should be good.

rhshah commented 4 years ago

thank you @ionox0 and @kanika-arora, I think the unfiltered bam will be the one before we start the filtering process and then we run FilterConsensReads.

for min-base-quality I think 20 should be good enough as that is what we use in most cases

for require-single-strand-agreement @kanika-arora based on the command above you are not using it, would that change the results?

kanika-arora commented 4 years ago

Regarding require-single-strand-agreement, yes that would change results slightly. I don't think it should affect the true positives much though. At least the cases that I had reviewed manually, there didn't seem to be strand discordance. If anything it should reduce noise. I think min-base-quality threshold of 20 seems okay if the consensus base quality distribution is similar to that of raw reads. I am not sure if that is the case though. I will plot the quality score distribution for a few samples so that we can finalize this threshold.

rhshah commented 4 years ago

@kanika-arora thank you do let us know when you have the results

kanika-arora commented 4 years ago

DONOR22-TP_qual_score_dist.pdf This is quality score distribution for one sample. I showed this to Mike too. He too suggested 20 as the cutoff. He also agreed with setting require-single-strand-agreement to true. That shouldn't have a big impact. I think it would only slightly improve specificity. I was initially thinking of testing the parameter on the samples we used for fgbio vs marianas benchmarking, but now I am thinking it will only slow us down. I think we can implement it and test it with all of the other changes. But let me know if you disagree.

rhshah commented 4 years ago

Thanks @kanika-arora #71