Closed durrantmm closed 2 years ago
Thanks for reaching out. I am unfortunately unable to reproduce the above results / error from this data. It seems as though these values do not correspond to this table: the mean_target_hamming_distance for the above table is 9.44, and the mean_target_levenshtein_distance is 8.04. Further, the target_entropy is 3.5. Running the table through compute_pvals.py and compute_spectral_pvals.py gives these correct values. Could you please provide more information / the files you ran this on?
I apologize, the table above was inaccurate. This is what is available in the file abundant_stratified_TTT.txt.gz
.
1 TTTAGTTCCATCGAATTACGCAACAAAAAACACCACTTTAAACTTTTGAAAAGA SRR13716853_1
1 TTTAGTTCCATCGAATTACGCAACAAAATTGAACTTGATAATCTTCAAGCAAAT SRR14320056_1
1 TTTAGTTCCATCGAATTACGCAACAAAATTGAACTTGATAATCTTCAAGCAATT SRR14320056_1
1 TTTAGTTCCATCGAATTACGCAACAAAATTGTTCATCGTTTCACCGTTTTTATA SRR14320061_1
1 TTTAGTTCCATCGAATTACGCAACAAAATTTCGAACAGGGGAAAATGCACAGTA SRR13716853_1
1 TTTAGTTCCATCGAATTACGCAACAAAATTTTGGCCGCTTGTACGAAGAAGATT SRR13716853_1
1 TTTAGTTCCATCGAATTACGCAACAAACAAAACTGTTACTTTGTGCCATAATAC SRR13716857_1
1 TTTAGTTCCATCGAATTACGCAACAAACGAATTTTCACCACTAATTGTAGTACA SRR14320055_1
1 TTTAGTTCCATCGAATTACGCAACAAACTAAAATCTGTTTTGTTTAAATAACCA SRR13716853_1
1 TTTAGTTCCATCGAATTACGCAACAAACTTGTTCATCGTTTCACCGTTTTTATA SRR13716869_1
1 TTTAGTTCCATCGAATTACGCAACAAAGTAGCACTTGCAAAGTCGGCTAAGGGC SRR13716857_1
1 TTTAGTTCCATCGAATTACGCAACAAAGTATTTATCTATGTCTCTAAATGTAAC SRR13716848_1
1 TTTAGTTCCATCGAATTACGCAACAAAGTTGGAAAAACTCTGCGTAGACCTTGT SRR14320055_1
1 TTTAGTTCCATCGAATTACGCAACAAAGTTTATAAGAATATTATTACGGTGGTT SRR14320056_1
1 TTTAGTTCCATCGAATTACGCAACAAAGTTTATAAGAATATTATTCCGGTGGTT SRR14320056_1
1 TTTAGTTCCATCGAATTACGCAACAAATAGAATTTTTAATTAATGTAGCAAAAT SRR13716867_1
1 TTTAGTTCCATCGAATTACGCAACAAATAGAATTTTTAATTAATGTAGCAAAAT SRR14320052_1
11 TTTAGTTCCATCGAATTACGCAACAAAGTTTATAAGAATATGATTCCGGTGGTT SRR13716853_1
2 TTTAGTTCCATCGAATTACGCAACAAAACCCCAACGGGGAATGAAACAGGCATT SRR14320056_1
2 TTTAGTTCCATCGAATTACGCAACAAAAGATATAAGTGTTTGAAGTTCGTGAAG SRR14320055_1
2 TTTAGTTCCATCGAATTACGCAACAAAAGGCGTCGGTTATTACTTTGACCAACA SRR13716853_1
2 TTTAGTTCCATCGAATTACGCAACAAAATATGATGTTTTTCAGTAAAAGAGTAG SRR13716853_1
2 TTTAGTTCCATCGAATTACGCAACAAAATTGTTCATCGTTTCACCGTTTTTATA SRR13716869_1
2 TTTAGTTCCATCGAATTACGCAACAAAATTGTTCATCGTTTCACCGTTTTTATA SRR14320051_1
2 TTTAGTTCCATCGAATTACGCAACAAAATTGTTCATCGTTTCACCGTTTTTATA SRR14320060_1
2 TTTAGTTCCATCGAATTACGCAACAAACTAACTTACAACAATATTAAATATCAT SRR13716853_1
2 TTTAGTTCCATCGAATTACGCAACAAAGGCTCTATTCCGAATAGATAGAATCCC SRR14320055_1
2 TTTAGTTCCATCGAATTACGCAACAAAGTTTATAAGAATATGATTCCGGTGGTT SRR14320055_1
2 TTTAGTTCCATCGAATTACGCAACAAATGAAAACCACCGGCATACGCCGTCGAT SRR13716853_1
2 TTTAGTTCCATCGAATTACGCAACAAATTGTAGTTGAGGTAGCCATTTCTGGGC SRR14320055_1
2 TTTAGTTCCATCGAATTACGCAACAAATTGTTAAAAAAGTTGTCCTACTAGTAA SRR14320055_1
3 TTTAGTTCCATCGAATTACGCAACAAACTAAAATCTGTTTTGTTTAAATCACCA SRR13716853_1
3 TTTAGTTCCATCGAATTACGCAACAAAGAAGCGGACGAGACGGGTGACAATGTG SRR13716853_1
3 TTTAGTTCCATCGAATTACGCAACAAAGATAACGCAATATTTAACGTCAAAGAC SRR13716853_1
3 TTTAGTTCCATCGAATTACGCAACAAAGTACAAGCGCACATAACAAACTGTTCA SRR14320055_1
3 TTTAGTTCCATCGAATTACGCAACAAATAGAAATGGTCGCCATTTCAATCTACG SRR14320055_1
3 TTTAGTTCCATCGAATTACGCAACAAATAGAATTTTTAATTAATGTAGCAAAAT SRR13716870_1
3 TTTAGTTCCATCGAATTACGCAACAAATCTAATGGAGAGAAAAAGAAGCAGTTC SRR14320055_1
3 TTTAGTTCCATCGAATTACGCAACAAATTGTTAAAAAAGTTGTCCTACTAGTAA SRR13716853_1
3 TTTAGTTCCATCGAATTACGCAACAAATTGTTAAAAAAGTTGTCCTACTAGTAA SRR13716857_1
4 TTTAGTTCCATCGAATTACGCAACAAAAGCGTATCACCGCCTCCAATATTTTTA SRR13716857_1
4 TTTAGTTCCATCGAATTACGCAACAAACTGTCACTAACACCTGAGTTCGAAACA SRR13716853_1
4 TTTAGTTCCATCGAATTACGCAACAAAGTAGCACTTGCAAAGTCGGCTAAGAGC SRR13716857_1
4 TTTAGTTCCATCGAATTACGCAACAAATAGAATTTTTAATTAATGTAGCAAAAT SRR14320051_1
5 TTTAGTTCCATCGAATTACGCAACAAACAAAACTGTTACTTTGAGCCATAATAC SRR13716857_1
5 TTTAGTTCCATCGAATTACGCAACAAAGTTTTTATCTATGTCTCTATATGTATC SRR13716848_1
5 TTTAGTTCCATCGAATTACGCAACAAATATTGTAAATCAAATACTCTACCTTAC SRR13716853_1
7 TTTAGTTCCATCGAATTACGCAACAAAGAAAGAAGCAAAACATCATGCTGCAAG SRR14320056_1
7 TTTAGTTCCATCGAATTACGCAACAAAGTTTATAAGAATATGATTCCGGTGGTT SRR14320056_1
This is the full corresponding line for this anchor from the all_anchors_pvals.tsv
file with the strange distances.
anchor pval_random effect_size_random optHash num_observations target_entropy mean_target_hamming_distance mean_target_levenshtein_distance number_nonzero_samples cj_rand_opt_SRR13716847_1 cj_rand_opt_SRR13716848_1 cj_rand_opt_SRR13716849_1 cj_rand_opt_SRR13716850_1 cj_rand_opt_SRR13716851_1 cj_rand_opt_SRR13716852_1 cj_rand_opt_SRR13716853_1 cj_rand_opt_SRR13716854_1 cj_rand_opt_SRR13716855_1 cj_rand_opt_SRR13716856_1 cj_rand_opt_SRR13716857_1 cj_rand_opt_SRR13716858_1 cj_rand_opt_SRR13716859_1 cj_rand_opt_SRR13716861_1 cj_rand_opt_SRR13716862_1 cj_rand_opt_SRR13716863_1 cj_rand_opt_SRR13716864_1 cj_rand_opt_SRR13716865_1 cj_rand_opt_SRR13716866_1 cj_rand_opt_SRR13716867_1 cj_rand_opt_SRR13716868_1 cj_rand_opt_SRR13716869_1 cj_rand_opt_SRR13716870_1 cj_rand_opt_SRR13716871_1 cj_rand_opt_SRR13716872_1 cj_rand_opt_SRR13716873_1 cj_rand_opt_SRR13716874_1 cj_rand_opt_SRR13716876_1 cj_rand_opt_SRR13716878_1 cj_rand_opt_SRR13716879_1 cj_rand_opt_SRR13716880_1 cj_rand_opt_SRR13716881_1 cj_rand_opt_SRR14320045_1 cj_rand_opt_SRR14320046_1 cj_rand_opt_SRR14320047_1 cj_rand_opt_SRR14320048_1 cj_rand_opt_SRR14320049_1 cj_rand_opt_SRR14320050_1 cj_rand_opt_SRR14320051_1 cj_rand_opt_SRR14320052_1 cj_rand_opt_SRR14320053_1 cj_rand_opt_SRR14320054_1 cj_rand_opt_SRR14320055_1 cj_rand_opt_SRR14320056_1 cj_rand_opt_SRR14320057_1 cj_rand_opt_SRR14320058_1 cj_rand_opt_SRR14320059_1 cj_rand_opt_SRR14320060_1 cj_rand_opt_SRR14320061_1 pval_random_corrected
TTTAGTTCCATCGAATTACGCAACAAA 1.0 0.2328869047619047 9.0 85.0 1.1863137138648343 15.939130434782609 0.4695652173913043 6 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 -1.0 0.0 0.0 0.0 0.0 0.0 1.0
I pulled the latest version of the repository and ran compute_pvals on the above table with the command:
python code/nomad/bin/compute_pvals.py --infile TTTAGTTCCATCGAATTACGCAACAAA.tsv.gz --outfile_scores results.tsv
And I got similar results with the hamming distance and levenshtein distances being very different:
anchor pval_random effect_size_random optHash num_observations target_entropy mean_target_hamming_distance mean_target_levenshtein_distance number_nonzero_samples cj_rand_opt_SRR13716853_1 cj_rand_opt_SRR14320055_1 cj_rand_opt_SRR14320056_1
TTTAGTTCCATCGAATTACGCAACAAA 1.0 0.23288690476190477 9.0 85.0 1.1863137138648345 16.097560975609756 0.43902439024390244 9 -1.0 1.0 -1.0
I ran the compute_spectral_pvals.py
on the same file with the command:
python code/nomad/bin/compute_spectral_pvals.py --infile tmp.tsv.gz --outfile_scores results.tsv
And got these results:
anchor pval_SVD_corrAnalysis effect_size_cts_SVD pval_rand_init_EM pval_base effect_size_base M anch_uniqTargs number_nonzero_samples target_entropy entropy_difference mean_target_levenshtein_distance mean_target_hamming_distance
TTTAGTTCCATCGAATTACGCAACAAA 0.090951377 0.337413283 0.000151066 0.001432949 0.354983203 123 39 9 4.758697007 2.501551102 13.6097561 16.09756098
You can see that the levenshtein distance more closely matches the hamming distance. So it looks like it's just an issue with compute_pvals.py
, which I assume you're going to just be replacing with compute_spectral_pvals.py
?
Thanks for bringing this up. Yes, I can now see the bug you found; this was due to some unfortunate column naming and pandas merge issues. The latter distances are correct (~13.6097561 and 16.09756098), and I have updated the compute_pvals.py to reflect this. For the settings we were looking at these discrepancies were not so egregious, and so this went unnoticed for too long. Thanks again for your careful reading!
I've been puzzled by some unusually low mean target levenshtein distances. The hamming distances seem accurate, but not the levenshtein distances. I've been analyzing a collection of vibrio isolates with the nextflow implementation of NOMAD.
Here is a count table that I've observed:
The targets are highly diverse, but here are the reported metrics:
There's a high mean hamming distance, which I expected, but the levenshtein distance is low. Any idea what's going on?
Best, Matt D