Write CWL and Docker for Fgbio FilterConsensusReads

rhshah commented 4 years ago

Tool: http://fulcrumgenomics.github.io/fgbio/tools/latest/FilterConsensusReads.html Version: https://github.com/fulcrumgenomics/fgbio/releases/tag/1.2.0

Example Command:

Duplex BAM

/opt/common/CentOS_7/java/jdk1.8.0_202/bin/java -Djava.io.tmpdir=/scratch/ -Xmx4g -jar /home/arorak/repositories/fgbio/target/scala-2.13/fgbio-1.2.0-c8e1c41-SNAPSHOT.jar FilterConsensusReads --input /juno/work/bergerm1/bergerlab/arorak/Project_benchmark_fgbio/FqToBamFgbio_2020-06-10/NovaSeq_Plasma_Healthy_Donor/DONOR11-N/DONOR11-N.fqtobam.clip.map.md.abra2.fm.group.cons.map.abra2.fm.bam --output /juno/work/bergerm1/bergerlab/arorak/Project_benchmark_fgbio/FqToBamFgbio_2020-06-10/NovaSeq_Plasma_Healthy_Donor/DONOR11-N/DONOR11-N.fqtobam.clip.map.md.abra2.fm.group.cons.map.abra2.fm.duplex.bam -r /work/access/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta -M=2 1 1 -N=30 --reverse-per-base-tags=true

Simplex + Duplex :

/opt/common/CentOS_7/java/jdk1.8.0_202/bin/java -Djava.io.tmpdir=/scratch/ -Xmx4g -jar /home/arorak/repositories/fgbio/target/scala-2.13/fgbio-1.2.0-c8e1c41-SNAPSHOT.jar FilterConsensusReads --input /juno/work/bergerm1/bergerlab/arorak/Project_benchmark_fgbio/FqToBamFgbio_2020-06-10/NovaSeq_Plasma_Healthy_Donor/DONOR11-N/DONOR11-N.fqtobam.clip.map.md.abra2.fm.group.cons.map.abra2.fm.bam --output /juno/work/bergerm1/bergerlab/arorak/Project_benchmark_fgbio/FqToBamFgbio_2020-06-10/NovaSeq_Plasma_Healthy_Donor/DONOR11-N/DONOR11-N.fqtobam.clip.map.md.abra2.fm.group.cons.map.abra2.fm.simplex.pre.bam -r /work/access/production/resources/reference/versions/hg19/Homo_sapiens_assembly19.fasta -M=3 3 0 -N=30 --reverse-per-base-tags=true

[x] CWL
[x] Docker
[ ] Test with cwltool/toil
[x] Example input
[x] Readme

rhshah commented 4 years ago

@ionox0 @andurill @kanika-arora we should discuss the parameters for this tool and understand them what is logical and what is not.

rhshah commented 4 years ago

I will make the CWL and put the options here so we can discuss them here

rhshah commented 4 years ago

option	option short prefix	type	description	requirement	Max Values	default value
reverse-per-base-tags	R	Boolean	Reverse [complement] per base tags on reverse strand reads.	Optional	1	false
min-reads	M	Int	The minimum number of reads supporting a consensus base/read.	Required	3
max-read-error-rate	E	Double	The maximum raw-read error rate across the entire consensus read.	Required	3	0.025
max-base-error-rate	e	Double	The maximum error rate for a single consensus base.	Required	3	0.1
min-base-quality	N	PhredScore	Mask (make N) consensus bases with quality less than this threshold.	Required	1
max-no-call-fraction	n	Double	Maximum fraction of no-calls in the read after filtering.	Optional	1	0.2
min-mean-base-quality	q	PhredScore	The minimum mean base quality across the consensus read.	Optional	1
require-single-strand-agreement	s	Boolean	Mask (make N) consensus bases where the AB and BA consensus reads disagree (for duplex-sequencing only).	Optional	1	false

@andurill @ionox0 @kanika-arora @murphycj2 we should discuss this option, from the above example command we are only using the

min-reads (duplex - 2 1 1; unfiltered - 3 3 0)
min-base-quality (30)
reverse-per-base-tags (true)

Please let us know your thoughts on

max-read-error-rate
max-base-error-rate
max-no-call-fraction
min-mean-base-quality
require-single-strand-agreement

Obviously this will differ for Duplex and Unfiltered. But it will be good for all of us to be on the same page with regards to its usage.

ionox0 commented 4 years ago

One minor point, I believe 3 3 0 would be for "Simplex" + "Duplex" reads rather than "Unfiltered" (it wouldn't include singletons or 2-read families)

Apart from that, here are the equivalent features of Marianas:

max-base-error-rate - This should theoretically achieve a similar thing as minConsensusPercent in Marianas. We set this to 90, so perhaps setting it to 10 would give us an equivalent thing to Marianas.
max-no-call-fraction - Marianas will discard reads that are all "N", but this param would give us more control I believe. If we want to be consistent with Marianas we might want to set this to 1, but it also might make sense to just go with the default
require-single-strand-agreement - Marianas will essentially always have this set as true because mismatches from either strand are masked to "N"
min-base-quality - This is not the same as Marianas min_basq. In Marianas, the min base quality is used as a threshold for bases to be included in the consensus, and here it is used after consensus calling on the new consensus base qualities. So I don't think we will be able to achieve a similar base quality threshold functionality based on the existing parameters

Marianas doesn't do any sort of filtering that fgbio does for these parameters as far as I'm aware, so we shouldn't need to use them unless we are trying to make improvements:

max-read-error-rate (I assume this means errors within the consensus read, as opposed to mismatches from the reference)
min-mean-base-quality

Marianas does however include a min_mapq param to only collapse reads with adequate mapping qualities, but we've set that to 1 so it should have little effect.

As for the min-reads param we should check that we are using the correct definition of "Unfiltered" with 3 3 0 (which I believe to be simplex + duplex)

kanika-arora commented 4 years ago

I second what Ian said about 3 3 0, that file is the precursor to simplex BAM. The python script create_simplex_bam_from_consensus.py removes duplex reads from this BAM to create the simplex BAM.

@ionox0 I set min-base-quality to 30 because that is what was set in the script that Dilmi (and you?) had put together. I was hoping you could tell me why that was chosen. I am not sure if 30 is too stringent or not. Mike initially thought that it may be. It would be best to look at the quality score distribution of consensus bases to select a threshold.

Regarding require-single-strand-agreement, I believe if there is disagreement, marianas picks the reference base if that is one of the bases. And if both strands have non-reference bases, then it introduces a N. Not that we are trying to mimic Marianas, but there may not be a way to do that here. Still, I think we should set this option to true. But we should ask Mike too.

For all other options, I think what values we have right now should be good.

rhshah commented 4 years ago

thank you @ionox0 and @kanika-arora, I think the unfiltered bam will be the one before we start the filtering process and then we run FilterConsensReads.

for min-base-quality I think 20 should be good enough as that is what we use in most cases

for require-single-strand-agreement @kanika-arora based on the command above you are not using it, would that change the results?

kanika-arora commented 4 years ago

Regarding require-single-strand-agreement, yes that would change results slightly. I don't think it should affect the true positives much though. At least the cases that I had reviewed manually, there didn't seem to be strand discordance. If anything it should reduce noise. I think min-base-quality threshold of 20 seems okay if the consensus base quality distribution is similar to that of raw reads. I am not sure if that is the case though. I will plot the quality score distribution for a few samples so that we can finalize this threshold.

rhshah commented 4 years ago

@kanika-arora thank you do let us know when you have the results

kanika-arora commented 4 years ago

DONOR22-TP_qual_score_dist.pdf This is quality score distribution for one sample. I showed this to Mike too. He too suggested 20 as the cutoff. He also agreed with setting require-single-strand-agreement to true. That shouldn't have a big impact. I think it would only slightly improve specificity. I was initially thinking of testing the parameter on the samples we used for fgbio vs marianas benchmarking, but now I am thinking it will only slow us down. I think we can implement it and test it with all of the other changes. But let me know if you disagree.

rhshah commented 4 years ago

Thanks @kanika-arora #71

msk-access / cwl-commandlinetools

Write CWL and Docker for Fgbio FilterConsensusReads #58