stenglein-lab / TreeTangler

MIT License
0 stars 0 forks source link

What will happen if # tips on the two trees are unequal? #6

Open stenglein-lab opened 6 years ago

meekrob commented 6 years ago

There seems to be disagreement between leaves of the trees Orthobunyavirus_M and Orthobunyavirus_L. Using cophy-treetools/bin/test_leaf_lookups.js, we get: M versus L:

Comparing Orthobunyavirus_M and Orthobunyavirus_L 6: Patios_sample -1 NOT_FOUND 91 Patois_sample
13: Itimirim_sample -1 NOT_FOUND 60 Mirim_sample
15: Acara_sample -1 NOT_FOUND 66 Acra_sample 22: Ananindeua_sample_real -1 NOT_FOUND 73 Ananindeua_sample
31: Gumbo_Limbo_sample -1 NOT_FOUND 80 Gumboimbo_sample
32: Nepuyo_sample -1 NOT_FOUND 4 Aino_sample 47: Faceys_Paddock_reference -1 NOT_FOUND 14 Faceys_Paddock_refrence 56: Boracein_batch3_sample -1 NOT_FOUND 48 Boraceia_batch3_sample
60: Lasaloyas_sample -1 NOT_FOUND 50 Las_Maloyas_sample
62: Okala_sample -1 NOT_FOUND 54 Okola_sample
87: Bunyamwera_reference -1 NOT_FOUND 19 Bunyamwera_refrence 95: Watermelon_silverottle_reference -1 NOT_FOUND 95 Watermelon_silver_mottle_reference

... whereas L versus M gives:

Comparing Orthobunyavirus_L and Orthobunyavirus_M 14: Faceys_Paddock_refrence -1 NOT_FOUND 47 Faceys_Paddock_reference
19: Bunyamwera_refrence -1 NOT_FOUND 87 Bunyamwera_reference
48: Boraceia_batch3_sample -1 NOT_FOUND 56 Boracein_batch3_sample
50: Las_Maloyas_sample -1 NOT_FOUND 60 Lasaloyas_sample
54: Okola_sample -1 NOT_FOUND 67 Nola_sample 64: Itimirim_Kappa_sample -1 NOT_FOUND 13 Itimirim_sample 66: Acra_sample -1 NOT_FOUND 15 Acara_sample
73: Ananindeua_sample -1 NOT_FOUND 22 Ananindeua_sample_real
80: Gumboimbo_sample -1 NOT_FOUND 31 Gumbo_Limbo_sample
81: NEPV -1 NOT_FOUND 32 Nepuyo_sample
91: Patois_sample -1 NOT_FOUND 6 Patios_sample
95: Watermelon_silver_mottle_reference -1 NOT_FOUND 95 Watermelon_silverottle_reference

Most replacements are misspellings or variations, but there are some cases where the best match isn't the same in both directions. Comparing the M tree to the L tree, Nepuyo_sample matches closest to Aino_sample. However, the reciprocal comparison omits this warning, suggesting that Aino_sample indeed exists in both the M tree and L trees, but Nepuyo_sample is just matching the closest string erroneously.

Likewise, there are discrepancies going the L to M direction. Okola_sample should match Okala_sample in M, since they are one character different. Nola_sample --> Okola_sample is edit distance 2, right? Check the outcome of stringsimilarity.bestMatch.

Also, in L there is NEPV matching to Nepuyo_sample in M.

This condition will force any leaf without an appropriate match to attach to the most similarly named leaf, even if it's already matched. Reciprocal-best will eliminate this case.

meekrob commented 6 years ago

By aping string-similarity with a different module fast-levenshtein, we get the right answer for Okola_sample <=> Okala_sample, but lose the connection between NEPV and Nepuyo_sample. fast-levenshtein does appear to be slightly faster than string-similarity. Considering switching.