If semantic distance cutoff is not sufficient to reign-in unrelatedness, explicitly add the semantic distance to the ranking function as a (weak) additional weight
Also include # phonemes phoneme prior to overlap (the fewer the better). Related to some notion of % of words' "phonetic content" contained in the overlap
If semantic distance cutoff is not sufficient to reign-in unrelatedness, explicitly add the semantic distance to the ranking function as a (weak) additional weight