salzman-lab / SPLASH

MIT License
40 stars 4 forks source link

Possible error with Levenshtein distance calculation #43

Closed durrantmm closed 2 years ago

durrantmm commented 2 years ago

I've been puzzled by some unusually low mean target levenshtein distances. The hamming distances seem accurate, but not the levenshtein distances. I've been analyzing a collection of vibrio isolates with the nextflow implementation of NOMAD.

Here is a count table that I've observed:

target  SRR13716848 SRR13716853 SRR13716857 SRR13716867 SRR13716869 SRR13716870 SRR14320051 SRR14320052 SRR14320055 SRR14320056 SRR14320060 SRR14320061
AAACACCACTTTAAACTTTTGAAAAGA 0   1   0   0   0   0   0   0   0   0   0   0
ACCCCAACGGGGAATGAAACAGGCATT 0   0   0   0   0   0   0   0   0   2   0   0
AGATATAAGTGTTTGAAGTTCGTGAAG 0   0   0   0   0   0   0   0   2   0   0   0
AGCGTATCACCGCCTCCAATATTTTTA 0   0   4   0   0   0   0   0   0   0   0   0
AGGCGTCGGTTATTACTTTGACCAACA 0   2   0   0   0   0   0   0   0   0   0   0
ATATGATGTTTTTCAGTAAAAGAGTAG 0   2   0   0   0   0   0   0   0   0   0   0
ATTGAACTTGATAATCTTCAAGCAAAT 0   0   0   0   0   0   0   0   0   1   0   0
ATTGAACTTGATAATCTTCAAGCAATT 0   0   0   0   0   0   0   0   0   1   0   0
ATTGTTCATCGTTTCACCGTTTTTATA 0   0   0   0   2   0   2   0   0   0   2   1
ATTTCGAACAGGGGAAAATGCACAGTA 0   1   0   0   0   0   0   0   0   0   0   0
ATTTTGGCCGCTTGTACGAAGAAGATT 0   1   0   0   0   0   0   0   0   0   0   0
CAAAACTGTTACTTTGAGCCATAATAC 0   0   5   0   0   0   0   0   0   0   0   0
CAAAACTGTTACTTTGTGCCATAATAC 0   0   1   0   0   0   0   0   0   0   0   0
CGAATTTTCACCACTAATTGTAGTACA 0   0   0   0   0   0   0   0   1   0   0   0
CTAAAATCTGTTTTGTTTAAATAACCA 0   1   0   0   0   0   0   0   0   0   0   0
CTAAAATCTGTTTTGTTTAAATCACCA 0   3   0   0   0   0   0   0   0   0   0   0
CTAACTTACAACAATATTAAATATCAT 0   2   0   0   0   0   0   0   0   0   0   0
CTGTCACTAACACCTGAGTTCGAAACA 0   4   0   0   0   0   0   0   0   0   0   0
CTTGTTCATCGTTTCACCGTTTTTATA 0   0   0   0   1   0   0   0   0   0   0   0
GAAAGAAGCAAAACATCATGCTGCAAG 0   0   0   0   0   0   0   0   0   7   0   0
GAAGCGGACGAGACGGGTGACAATGTG 0   3   0   0   0   0   0   0   0   0   0   0
GATAACGCAATATTTAACGTCAAAGAC 0   3   0   0   0   0   0   0   0   0   0   0
GGCTCTATTCCGAATAGATAGAATCCC 0   0   0   0   0   0   0   0   2   0   0   0
GTACAAGCGCACATAACAAACTGTTCA 0   0   0   0   0   0   0   0   3   0   0   0
GTAGCACTTGCAAAGTCGGCTAAGAGC 0   0   4   0   0   0   0   0   0   0   0   0
GTAGCACTTGCAAAGTCGGCTAAGGGC 0   0   1   0   0   0   0   0   0   0   0   0
GTATTTATCTATGTCTCTAAATGTAAC 1   0   0   0   0   0   0   0   0   0   0   0
GTTGGAAAAACTCTGCGTAGACCTTGT 0   0   0   0   0   0   0   0   1   0   0   0
GTTTATAAGAATATGATTCCGGTGGTT 0   11  0   0   0   0   0   0   2   7   0   0
GTTTATAAGAATATTATTACGGTGGTT 0   0   0   0   0   0   0   0   0   1   0   0
GTTTATAAGAATATTATTCCGGTGGTT 0   0   0   0   0   0   0   0   0   1   0   0
GTTTTTATCTATGTCTCTATATGTATC 5   0   0   0   0   0   0   0   0   0   0   0
TAGAAATGGTCGCCATTTCAATCTACG 0   0   0   0   0   0   0   0   3   0   0   0
TAGAATTTTTAATTAATGTAGCAAAAT 0   0   0   1   0   3   4   1   0   0   0   0
TATTGTAAATCAAATACTCTACCTTAC 0   5   0   0   0   0   0   0   0   0   0   0
TCTAATGGAGAGAAAAAGAAGCAGTTC 0   0   0   0   0   0   0   0   3   0   0   0
TGAAAACCACCGGCATACGCCGTCGAT 0   2   0   0   0   0   0   0   0   0   0   0
TTGTAGTTGAGGTAGCCATTTCTGGGC 0   0   0   0   0   0   0   0   2   0   0   0
TTGTTAAAAAAGTTGTCCTACTAGTAA 0   3   3   0   0   0   0   0   2   0   0   0
CATGCCTAAACCTCGGTACAAAACAAC 0   0   1   0   0   0   0   0   0   0   0   0
CATGCCTAAACCTCGGTACAAATCAAC 0   0   0   0   0   0   0   0   1   0   0   0
CATGCCTAAACCTCGTTACAAAACAAC 0   54  5   2   2   3   5   9   28  19  6   5
CATGCCTAAACCTCGTTACAAAACCAC 0   0   1   0   0   0   0   0   0   0   0   0
CATGCCTAAACCTCGTTACGAAACAAC 0   0   0   0   0   0   0   1   0   0   0   0
CATGCCTAGACCTCGTTACAAAACAAC 0   0   1   0   0   0   0   0   0   0   0   0
CCTGCCTAAACCTCGTTACAAAACAAC 0   0   1   0   0   0   0   0   0   0   0   0
CCTGCCTAAACCTCGTTAGAAAACAAC 0   0   1   0   0   0   0   0   0   0   0   0
TATGCCTGAAACTCAGTATATTGTATA 1   0   0   0   0   0   0   0   0   0   0   0
TCGCCGATTCAGCGACTTAGCCATTAC 0   0   5   0   0   0   0   0   0   0   0   0

The targets are highly diverse, but here are the reported metrics:

anchor  target_entropy  mean_target_hamming_distance    mean_target_levenshtein_distance
TTTAGTTCCATCGAATTACGCAACAAA 1.1863137138648343  15.939130434782609  0.4695652173913043

There's a high mean hamming distance, which I expected, but the levenshtein distance is low. Any idea what's going on?

Best, Matt D

TavorB commented 2 years ago

Thanks for reaching out. I am unfortunately unable to reproduce the above results / error from this data. It seems as though these values do not correspond to this table: the mean_target_hamming_distance for the above table is 9.44, and the mean_target_levenshtein_distance is 8.04. Further, the target_entropy is 3.5. Running the table through compute_pvals.py and compute_spectral_pvals.py gives these correct values. Could you please provide more information / the files you ran this on?

durrantmm commented 2 years ago

I apologize, the table above was inaccurate. This is what is available in the file abundant_stratified_TTT.txt.gz.

1 TTTAGTTCCATCGAATTACGCAACAAAAAACACCACTTTAAACTTTTGAAAAGA SRR13716853_1
1 TTTAGTTCCATCGAATTACGCAACAAAATTGAACTTGATAATCTTCAAGCAAAT SRR14320056_1
1 TTTAGTTCCATCGAATTACGCAACAAAATTGAACTTGATAATCTTCAAGCAATT SRR14320056_1
1 TTTAGTTCCATCGAATTACGCAACAAAATTGTTCATCGTTTCACCGTTTTTATA SRR14320061_1
1 TTTAGTTCCATCGAATTACGCAACAAAATTTCGAACAGGGGAAAATGCACAGTA SRR13716853_1
1 TTTAGTTCCATCGAATTACGCAACAAAATTTTGGCCGCTTGTACGAAGAAGATT SRR13716853_1
1 TTTAGTTCCATCGAATTACGCAACAAACAAAACTGTTACTTTGTGCCATAATAC SRR13716857_1
1 TTTAGTTCCATCGAATTACGCAACAAACGAATTTTCACCACTAATTGTAGTACA SRR14320055_1
1 TTTAGTTCCATCGAATTACGCAACAAACTAAAATCTGTTTTGTTTAAATAACCA SRR13716853_1
1 TTTAGTTCCATCGAATTACGCAACAAACTTGTTCATCGTTTCACCGTTTTTATA SRR13716869_1
1 TTTAGTTCCATCGAATTACGCAACAAAGTAGCACTTGCAAAGTCGGCTAAGGGC SRR13716857_1
1 TTTAGTTCCATCGAATTACGCAACAAAGTATTTATCTATGTCTCTAAATGTAAC SRR13716848_1
1 TTTAGTTCCATCGAATTACGCAACAAAGTTGGAAAAACTCTGCGTAGACCTTGT SRR14320055_1
1 TTTAGTTCCATCGAATTACGCAACAAAGTTTATAAGAATATTATTACGGTGGTT SRR14320056_1
1 TTTAGTTCCATCGAATTACGCAACAAAGTTTATAAGAATATTATTCCGGTGGTT SRR14320056_1
1 TTTAGTTCCATCGAATTACGCAACAAATAGAATTTTTAATTAATGTAGCAAAAT SRR13716867_1
1 TTTAGTTCCATCGAATTACGCAACAAATAGAATTTTTAATTAATGTAGCAAAAT SRR14320052_1
11 TTTAGTTCCATCGAATTACGCAACAAAGTTTATAAGAATATGATTCCGGTGGTT SRR13716853_1
2 TTTAGTTCCATCGAATTACGCAACAAAACCCCAACGGGGAATGAAACAGGCATT SRR14320056_1
2 TTTAGTTCCATCGAATTACGCAACAAAAGATATAAGTGTTTGAAGTTCGTGAAG SRR14320055_1
2 TTTAGTTCCATCGAATTACGCAACAAAAGGCGTCGGTTATTACTTTGACCAACA SRR13716853_1
2 TTTAGTTCCATCGAATTACGCAACAAAATATGATGTTTTTCAGTAAAAGAGTAG SRR13716853_1
2 TTTAGTTCCATCGAATTACGCAACAAAATTGTTCATCGTTTCACCGTTTTTATA SRR13716869_1
2 TTTAGTTCCATCGAATTACGCAACAAAATTGTTCATCGTTTCACCGTTTTTATA SRR14320051_1
2 TTTAGTTCCATCGAATTACGCAACAAAATTGTTCATCGTTTCACCGTTTTTATA SRR14320060_1
2 TTTAGTTCCATCGAATTACGCAACAAACTAACTTACAACAATATTAAATATCAT SRR13716853_1
2 TTTAGTTCCATCGAATTACGCAACAAAGGCTCTATTCCGAATAGATAGAATCCC SRR14320055_1
2 TTTAGTTCCATCGAATTACGCAACAAAGTTTATAAGAATATGATTCCGGTGGTT SRR14320055_1
2 TTTAGTTCCATCGAATTACGCAACAAATGAAAACCACCGGCATACGCCGTCGAT SRR13716853_1
2 TTTAGTTCCATCGAATTACGCAACAAATTGTAGTTGAGGTAGCCATTTCTGGGC SRR14320055_1
2 TTTAGTTCCATCGAATTACGCAACAAATTGTTAAAAAAGTTGTCCTACTAGTAA SRR14320055_1
3 TTTAGTTCCATCGAATTACGCAACAAACTAAAATCTGTTTTGTTTAAATCACCA SRR13716853_1
3 TTTAGTTCCATCGAATTACGCAACAAAGAAGCGGACGAGACGGGTGACAATGTG SRR13716853_1
3 TTTAGTTCCATCGAATTACGCAACAAAGATAACGCAATATTTAACGTCAAAGAC SRR13716853_1
3 TTTAGTTCCATCGAATTACGCAACAAAGTACAAGCGCACATAACAAACTGTTCA SRR14320055_1
3 TTTAGTTCCATCGAATTACGCAACAAATAGAAATGGTCGCCATTTCAATCTACG SRR14320055_1
3 TTTAGTTCCATCGAATTACGCAACAAATAGAATTTTTAATTAATGTAGCAAAAT SRR13716870_1
3 TTTAGTTCCATCGAATTACGCAACAAATCTAATGGAGAGAAAAAGAAGCAGTTC SRR14320055_1
3 TTTAGTTCCATCGAATTACGCAACAAATTGTTAAAAAAGTTGTCCTACTAGTAA SRR13716853_1
3 TTTAGTTCCATCGAATTACGCAACAAATTGTTAAAAAAGTTGTCCTACTAGTAA SRR13716857_1
4 TTTAGTTCCATCGAATTACGCAACAAAAGCGTATCACCGCCTCCAATATTTTTA SRR13716857_1
4 TTTAGTTCCATCGAATTACGCAACAAACTGTCACTAACACCTGAGTTCGAAACA SRR13716853_1
4 TTTAGTTCCATCGAATTACGCAACAAAGTAGCACTTGCAAAGTCGGCTAAGAGC SRR13716857_1
4 TTTAGTTCCATCGAATTACGCAACAAATAGAATTTTTAATTAATGTAGCAAAAT SRR14320051_1
5 TTTAGTTCCATCGAATTACGCAACAAACAAAACTGTTACTTTGAGCCATAATAC SRR13716857_1
5 TTTAGTTCCATCGAATTACGCAACAAAGTTTTTATCTATGTCTCTATATGTATC SRR13716848_1
5 TTTAGTTCCATCGAATTACGCAACAAATATTGTAAATCAAATACTCTACCTTAC SRR13716853_1
7 TTTAGTTCCATCGAATTACGCAACAAAGAAAGAAGCAAAACATCATGCTGCAAG SRR14320056_1
7 TTTAGTTCCATCGAATTACGCAACAAAGTTTATAAGAATATGATTCCGGTGGTT SRR14320056_1

This is the full corresponding line for this anchor from the all_anchors_pvals.tsv file with the strange distances.

anchor  pval_random effect_size_random  optHash num_observations    target_entropy  mean_target_hamming_distance    mean_target_levenshtein_distance    number_nonzero_samples  cj_rand_opt_SRR13716847_1   cj_rand_opt_SRR13716848_1   cj_rand_opt_SRR13716849_1   cj_rand_opt_SRR13716850_1   cj_rand_opt_SRR13716851_1   cj_rand_opt_SRR13716852_1   cj_rand_opt_SRR13716853_1   cj_rand_opt_SRR13716854_1   cj_rand_opt_SRR13716855_1   cj_rand_opt_SRR13716856_1   cj_rand_opt_SRR13716857_1   cj_rand_opt_SRR13716858_1   cj_rand_opt_SRR13716859_1   cj_rand_opt_SRR13716861_1   cj_rand_opt_SRR13716862_1   cj_rand_opt_SRR13716863_1   cj_rand_opt_SRR13716864_1   cj_rand_opt_SRR13716865_1   cj_rand_opt_SRR13716866_1   cj_rand_opt_SRR13716867_1   cj_rand_opt_SRR13716868_1   cj_rand_opt_SRR13716869_1   cj_rand_opt_SRR13716870_1   cj_rand_opt_SRR13716871_1   cj_rand_opt_SRR13716872_1   cj_rand_opt_SRR13716873_1   cj_rand_opt_SRR13716874_1   cj_rand_opt_SRR13716876_1   cj_rand_opt_SRR13716878_1   cj_rand_opt_SRR13716879_1   cj_rand_opt_SRR13716880_1   cj_rand_opt_SRR13716881_1   cj_rand_opt_SRR14320045_1   cj_rand_opt_SRR14320046_1   cj_rand_opt_SRR14320047_1   cj_rand_opt_SRR14320048_1   cj_rand_opt_SRR14320049_1   cj_rand_opt_SRR14320050_1   cj_rand_opt_SRR14320051_1   cj_rand_opt_SRR14320052_1   cj_rand_opt_SRR14320053_1   cj_rand_opt_SRR14320054_1   cj_rand_opt_SRR14320055_1   cj_rand_opt_SRR14320056_1   cj_rand_opt_SRR14320057_1   cj_rand_opt_SRR14320058_1   cj_rand_opt_SRR14320059_1   cj_rand_opt_SRR14320060_1   cj_rand_opt_SRR14320061_1   pval_random_corrected
TTTAGTTCCATCGAATTACGCAACAAA 1.0 0.2328869047619047  9.0 85.0    1.1863137138648343  15.939130434782609  0.4695652173913043  6   0.0 0.0 0.0 0.0 0.0 0.0 -1.0    0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 -1.0    0.0 0.0 0.0 0.0 0.0 1.0

I pulled the latest version of the repository and ran compute_pvals on the above table with the command:

python code/nomad/bin/compute_pvals.py --infile TTTAGTTCCATCGAATTACGCAACAAA.tsv.gz --outfile_scores results.tsv

And I got similar results with the hamming distance and levenshtein distances being very different:

anchor  pval_random effect_size_random  optHash num_observations    target_entropy  mean_target_hamming_distance    mean_target_levenshtein_distance    number_nonzero_samples  cj_rand_opt_SRR13716853_1   cj_rand_opt_SRR14320055_1   cj_rand_opt_SRR14320056_1
TTTAGTTCCATCGAATTACGCAACAAA 1.0 0.23288690476190477 9.0 85.0    1.1863137138648345  16.097560975609756  0.43902439024390244 9   -1.0    1.0 -1.0

I ran the compute_spectral_pvals.py on the same file with the command:

python code/nomad/bin/compute_spectral_pvals.py --infile tmp.tsv.gz --outfile_scores results.tsv

And got these results:

anchor  pval_SVD_corrAnalysis   effect_size_cts_SVD pval_rand_init_EM   pval_base   effect_size_base    M   anch_uniqTargs  number_nonzero_samples  target_entropy  entropy_difference  mean_target_levenshtein_distance    mean_target_hamming_distance
TTTAGTTCCATCGAATTACGCAACAAA 0.090951377 0.337413283 0.000151066 0.001432949 0.354983203 123 39  9   4.758697007 2.501551102 13.6097561  16.09756098

You can see that the levenshtein distance more closely matches the hamming distance. So it looks like it's just an issue with compute_pvals.py, which I assume you're going to just be replacing with compute_spectral_pvals.py?

TavorB commented 2 years ago

Thanks for bringing this up. Yes, I can now see the bug you found; this was due to some unfortunate column naming and pandas merge issues. The latter distances are correct (~13.6097561 and 16.09756098), and I have updated the compute_pvals.py to reflect this. For the settings we were looking at these discrepancies were not so egregious, and so this went unnoticed for too long. Thanks again for your careful reading!