mrvollger / SDA

Segmental Duplication Assembler (SDA).
MIT License
44 stars 6 forks source link

Sending `MINCOV` instead of `MAXCOV` to `PrintHetFreq.py`'s `maxCount` parameter #3

Closed franrodalg closed 5 years ago

franrodalg commented 5 years ago

Hi @mvollger! Really nice work! Congratulations and thanks for making this available for the community.

I was reproducing the code and I spotted something that I'm not sure if it needs fixing. In https://github.com/mvollger/SDA/blob/4a24160797ac6c9a29913038423f2599aa22574f/SDA.1.snakemake.py#L231 the current version of the code sends MINCOV to the maxCount parameter of PrintHetFreq.py. For context, this is within the following pipeline: https://github.com/mvollger/SDA/blob/4a24160797ac6c9a29913038423f2599aa22574f/SDA.1.snakemake.py#L228-L233 I suspect that was intended to be MAXCOV instead (or at least not MINCOV). The other time that PrintHetFreq.py is called, that parameter receives a quite big value: https://github.com/mvollger/SDA/blob/4a24160797ac6c9a29913038423f2599aa22574f/SDA.1.snakemake.py#L195-L200 which obviously differs from the unnamed parameter of the script in that case (0).

My apologies if the use of MINCOV was intentional and I did not understand the rationale behind it.

mrvollger commented 5 years ago

@franrodalg thanks for taking an interest!

I must admit that is a poorly named parameter, but it is set correctly.

This script counts the number of reads at a given position that are A,C,G, or T and if three or more of them are above maxCount it removes position from consideration before correlation clustering.

The idea here is that we are looking for paralog specific variants and MINCOV is the minimum coverage we will look for those variants at. However, if at one positions 3 or more basepairs have coverage greater than MINCOV then that positional difference is not specific to one paralog and it is better to filter it out before CC.

In the second case where it is just set to some absurdly high number, it is because it is used to generate a plot, and I do not want to filter out any of the positions, so I have them all for visualization.

Let me know if that explains it all.

Best, Mitchell

franrodalg commented 5 years ago

Thanks @mvollger ! That was really helpful. Sorry for the misunderstanding. Cheers, Fran