Find common words, sub-phrases in list of texts?

foo123 commented 9 years ago

Hello there,

i have an application where i have a list of texts (definitions) and a functionality is required to find common occurances of sub-phrases (sometimes just common words)

i would not like to use a simplistic approach to just match common words since the type of matching (e.g common sub-sequence or n-gram is important).

So i saw lunr.js and thinking about using it, any recommendations?

Note the texts are usually in greek, but i think i can add custom stemmers and stop-words for greek (per related issues/questions and references therein)

Thank you

olivernn commented 9 years ago

Let me check I understand what you're trying to implement:

You have a set of documents and you want to find all the documents which contain certain phrases? If this is correct then you can certainly use lunr for this, though you might want to customise it slightly for you use case.

lunr is specifically more about ranking a set of documents on how similar they are to a certain query. It sounds like you don't really care about ranking how similar they are, just whether a document contains some words.

You might want to look at creating custom processing, there is an old pull request that has code for adding an ngram tokeniser that might be useful to you

Greek shouldn't be an issue for lunr, you will just have to set up the right stemmers and stop words, you can see this done for other languages in the lunr-languages repo, if you do put together a greek language plugin please do add it to that repository.

Please re-open if you have any further questions or if anything is not clear.

foo123 commented 9 years ago

Thanks for the answer.

You have a set of documents and you want to find all the documents which contain certain phrases? If this is correct then you can certainly use lunr for this, though you might want to customise it slightly for you use case.

Not exactly, the converse is needed, i.e find documents (actualy textual definitions in the list of definiitons) which contain similar phrases (for example "the quick brown fox" and the "the quick red herring") sth like that. The context is an application creating crossword puzzles and in the process of adding definitions to a compiled puzzle the creator wants to make sure that definitions are more or less unique (without checking one-by-one manualy) meaning the definitions are not similar (like the example above).

Hope this clarifies the question even more. Any help (or pointers) is appreciated

olivernn / lunr.js

Find common words, sub-phrases in list of texts? #148