Closed rhshah closed 4 years ago
@ionox0 @andurill @kanika-arora we should discuss the parameters for this tool and understand them what is logical and what is not.
I will make the CWL and put the options here so we can discuss them here
option | option short prefix | type | description | requirement | Max Values | default value |
---|---|---|---|---|---|---|
reverse-per-base-tags | R | Boolean | Reverse [complement] per base tags on reverse strand reads. | Optional | 1 | false |
min-reads | M | Int | The minimum number of reads supporting a consensus base/read. | Required | 3 | |
max-read-error-rate | E | Double | The maximum raw-read error rate across the entire consensus read. | Required | 3 | 0.025 |
max-base-error-rate | e | Double | The maximum error rate for a single consensus base. | Required | 3 | 0.1 |
min-base-quality | N | PhredScore | Mask (make N) consensus bases with quality less than this threshold. | Required | 1 | |
max-no-call-fraction | n | Double | Maximum fraction of no-calls in the read after filtering. | Optional | 1 | 0.2 |
min-mean-base-quality | q | PhredScore | The minimum mean base quality across the consensus read. | Optional | 1 | |
require-single-strand-agreement | s | Boolean | Mask (make N) consensus bases where the AB and BA consensus reads disagree (for duplex-sequencing only). | Optional | 1 | false |
@andurill @ionox0 @kanika-arora @murphycj2 we should discuss this option, from the above example command we are only using the
Please let us know your thoughts on
Obviously this will differ for Duplex and Unfiltered. But it will be good for all of us to be on the same page with regards to its usage.
One minor point, I believe 3 3 0
would be for "Simplex" + "Duplex" reads rather than "Unfiltered" (it wouldn't include singletons or 2-read families)
Apart from that, here are the equivalent features of Marianas:
max-base-error-rate
- This should theoretically achieve a similar thing as minConsensusPercent
in Marianas. We set this to 90, so perhaps setting it to 10 would give us an equivalent thing to Marianas. max-no-call-fraction
- Marianas will discard reads that are all "N", but this param would give us more control I believe. If we want to be consistent with Marianas we might want to set this to 1, but it also might make sense to just go with the defaultrequire-single-strand-agreement
- Marianas will essentially always have this set as true
because mismatches from either strand are masked to "N"min-base-quality
- This is not the same as Marianas min_basq
. In Marianas, the min base quality is used as a threshold for bases to be included in the consensus, and here it is used after consensus calling on the new consensus base qualities. So I don't think we will be able to achieve a similar base quality threshold functionality based on the existing parametersMarianas doesn't do any sort of filtering that fgbio does for these parameters as far as I'm aware, so we shouldn't need to use them unless we are trying to make improvements:
max-read-error-rate
(I assume this means errors within the consensus read, as opposed to mismatches from the reference)min-mean-base-quality
Marianas does however include a min_mapq
param to only collapse reads with adequate mapping qualities, but we've set that to 1 so it should have little effect.
As for the min-reads
param we should check that we are using the correct definition of "Unfiltered" with 3 3 0
(which I believe to be simplex + duplex)
I second what Ian said about 3 3 0, that file is the precursor to simplex BAM. The python script create_simplex_bam_from_consensus.py removes duplex reads from this BAM to create the simplex BAM.
@ionox0 I set min-base-quality
to 30 because that is what was set in the script that Dilmi (and you?) had put together. I was hoping you could tell me why that was chosen. I am not sure if 30 is too stringent or not. Mike initially thought that it may be. It would be best to look at the quality score distribution of consensus bases to select a threshold.
Regarding require-single-strand-agreement
, I believe if there is disagreement, marianas picks the reference base if that is one of the bases. And if both strands have non-reference bases, then it introduces a N. Not that we are trying to mimic Marianas, but there may not be a way to do that here. Still, I think we should set this option to true. But we should ask Mike too.
For all other options, I think what values we have right now should be good.
thank you @ionox0 and @kanika-arora, I think the unfiltered bam will be the one before we start the filtering process and then we run FilterConsensReads.
for min-base-quality
I think 20 should be good enough as that is what we use in most cases
for require-single-strand-agreement
@kanika-arora based on the command above you are not using it, would that change the results?
Regarding require-single-strand-agreement
, yes that would change results slightly. I don't think it should affect the true positives much though. At least the cases that I had reviewed manually, there didn't seem to be strand discordance. If anything it should reduce noise.
I think min-base-quality
threshold of 20 seems okay if the consensus base quality distribution is similar to that of raw reads. I am not sure if that is the case though. I will plot the quality score distribution for a few samples so that we can finalize this threshold.
@kanika-arora thank you do let us know when you have the results
DONOR22-TP_qual_score_dist.pdf This is quality score distribution for one sample. I showed this to Mike too. He too suggested 20 as the cutoff. He also agreed with setting require-single-strand-agreement to true. That shouldn't have a big impact. I think it would only slightly improve specificity. I was initially thinking of testing the parameter on the samples we used for fgbio vs marianas benchmarking, but now I am thinking it will only slow us down. I think we can implement it and test it with all of the other changes. But let me know if you disagree.
Thanks @kanika-arora #71
Tool: http://fulcrumgenomics.github.io/fgbio/tools/latest/FilterConsensusReads.html Version: https://github.com/fulcrumgenomics/fgbio/releases/tag/1.2.0
Example Command:
Duplex BAM
Simplex + Duplex :