Unexpected number of patterns from count_patterns.py

ktmeaton commented 1 month ago

Hi Pyseer team!

I've found that count_patterns.py is outputting an unexpected number of patterns for the bonferroni correction. Specifically, the number of patterns is higher than expected, and the adjusted p-value seems too strict. I'm wondering if this may be due to how the sorting is implemented? The following is a minimal working example.

To Reproduce

I'm running pyseer v1.3.11 from conda and count_patterns.py from commit 0d19938. I'm using a subset of 15 genomes from the S. pneumoniae GWAS tutorial. And I'm looking for genomic variants that are associated with a group of sequences I've defined as Lineage 2.

Variants Presence/Absence Rtab: variants.txt
Lineages: lineage_2.txt

(I've just used an allele frequency filter to make this run faster)

$ pyseer \
    --min-af 0.25 --max-af 0.75 \
    --pres variants.txt \
    --phenotypes lineage_2.txt \
    --no-distances \
    --output-patterns lineage_2.patterns.txt \
    --cpu 4 \
    > lineage_2.variants.tsv

Read 15 phenotypes
Detected binary phenotype
82505 loaded variants
57670 pre-filtered variants
24835 tested variants
24835 printed variants

When I use the count_patterns.py script, I get the following:

$ python count_patterns.py lineage_2.patterns.txt 
Patterns:       1926
Threshold:      2.60E-05

But if I count the patterns myself, I get different values:

$ sort lineage_2.patterns.txt | uniq -u | wc -l
954
$ echo 954 | awk '{print 0.05/$1}'
5.24e-05

I'm pretty sure there should be at most 1096 patterns in the presence absence input. And that would be if all variants passed filtering. So the output of 1926 from count_patterns.py seems especially high?

$ tail -n+2 variants.txt | cut -f 2- | sort | uniq -u | wc -l
1096

I see that the internal command in count_patterns.py is:

LC_ALL=C sort -u  -S 1014M -T /tmp lineage_2.patterns.txt | wc -l

But since the --output-patterns file is not sorted, maybe it should be more like this?

LC_ALL=C sort -S 1014M -T /tmp lineage_2.patterns.txt | uniq -u | wc -l

Apologies if I misunderstood the statistical concepts underlying the bonferroni correction you've implemented! I'm just trying to get a deeper understanding of this step in the analysis.

Thanks, Katherine

johnlees commented 1 month ago

Hi Katherine, Thank you for your question, detailed information, and reproducible example!

I think our command is right, because the use of -u is not correct with uniq, as it entirely removes patterns that are seen more than once. For example with 1,2,3,1:

Command in script

echo "1\n2\n3\n1" | sort -u 
1
2
3

Suggested command:

echo "1\n2\n3\n1" | sort | uniq -u
2
3

But we do want to count 1 as a pattern.

I think our method of doing this with unix programs isn't the clearest and we probably could have done it in pure python, it was just (from memory) that this way runs a lot more quickly/memory efficiently if you are looking at many hundreds of millions of k-mers.

But let us know if there's something wrong the the interpretation here

ktmeaton commented 1 month ago

Oh gosh, I completely forgot how the -u parameter worked for uniq 😅 thank you so much for correcting my mistake with this very clear example!

mgalardini / pyseer

Unexpected number of patterns from count_patterns.py #275

To Reproduce