Closed ktmeaton closed 1 month ago
Hi Katherine, Thank you for your question, detailed information, and reproducible example!
I think our command is right, because the use of -u
is not correct with uniq, as it entirely removes patterns that are seen more than once. For example with 1,2,3,1:
Command in script
echo "1\n2\n3\n1" | sort -u
1
2
3
Suggested command:
echo "1\n2\n3\n1" | sort | uniq -u
2
3
But we do want to count 1 as a pattern.
I think our method of doing this with unix programs isn't the clearest and we probably could have done it in pure python, it was just (from memory) that this way runs a lot more quickly/memory efficiently if you are looking at many hundreds of millions of k-mers.
But let us know if there's something wrong the the interpretation here
Oh gosh, I completely forgot how the -u
parameter worked for uniq
š
thank you so much for correcting my mistake with this very clear example!
Hi Pyseer team!
I've found that
count_patterns.py
is outputting an unexpected number of patterns for the bonferroni correction. Specifically, the number of patterns is higher than expected, and the adjusted p-value seems too strict. I'm wondering if this may be due to how the sorting is implemented? The following is a minimal working example.To Reproduce
I'm running
pyseer v1.3.11
from conda andcount_patterns.py
from commit 0d19938. I'm using a subset of 15 genomes from the S. pneumoniae GWAS tutorial. And I'm looking for genomic variants that are associated with a group of sequences I've defined as Lineage 2.(I've just used an allele frequency filter to make this run faster)
When I use the count_patterns.py script, I get the following:
But if I count the patterns myself, I get different values:
I'm pretty sure there should be at most 1096 patterns in the presence absence input. And that would be if all variants passed filtering. So the output of 1926 from
count_patterns.py
seems especially high?I see that the internal command in
count_patterns.py
is:But since the
--output-patterns
file is not sorted, maybe it should be more like this?Apologies if I misunderstood the statistical concepts underlying the bonferroni correction you've implemented! I'm just trying to get a deeper understanding of this step in the analysis.
Thanks, Katherine