Closed franrodalg closed 5 years ago
@franrodalg thanks for taking an interest!
I must admit that is a poorly named parameter, but it is set correctly.
This script counts the number of reads at a given position that are A,C,G, or T and if three or more of them are above maxCount it removes position from consideration before correlation clustering.
The idea here is that we are looking for paralog specific variants and MINCOV is the minimum coverage we will look for those variants at. However, if at one positions 3 or more basepairs have coverage greater than MINCOV then that positional difference is not specific to one paralog and it is better to filter it out before CC.
In the second case where it is just set to some absurdly high number, it is because it is used to generate a plot, and I do not want to filter out any of the positions, so I have them all for visualization.
Let me know if that explains it all.
Best, Mitchell
Thanks @mvollger ! That was really helpful. Sorry for the misunderstanding. Cheers, Fran
Hi @mvollger! Really nice work! Congratulations and thanks for making this available for the community.
I was reproducing the code and I spotted something that I'm not sure if it needs fixing. In https://github.com/mvollger/SDA/blob/4a24160797ac6c9a29913038423f2599aa22574f/SDA.1.snakemake.py#L231 the current version of the code sends
MINCOV
to themaxCount
parameter ofPrintHetFreq.py
. For context, this is within the following pipeline: https://github.com/mvollger/SDA/blob/4a24160797ac6c9a29913038423f2599aa22574f/SDA.1.snakemake.py#L228-L233 I suspect that was intended to beMAXCOV
instead (or at least notMINCOV
). The other time thatPrintHetFreq.py
is called, that parameter receives a quite big value: https://github.com/mvollger/SDA/blob/4a24160797ac6c9a29913038423f2599aa22574f/SDA.1.snakemake.py#L195-L200 which obviously differs from the unnamed parameter of the script in that case (0
).My apologies if the use of
MINCOV
was intentional and I did not understand the rationale behind it.