lambdaofgod / findkit

A Python library for content-based information retrieval
7 stars 0 forks source link

Example of how to use with text documents? #1

Open jamesmcintyre opened 6 years ago

jamesmcintyre commented 6 years ago

This library looks awesome! Is there any way to use this for indexing and retrieving text documents by similarity? I know this would likely need a different type of feature extraction but I'm wondering if your library would easily adapt for that type of data?

I'm a front-end engineer so I'm not too familiar with the data science and machine learning involved so if it's possible to use your library for this purpose would it be hard to add an example or just reference in your readme.md some links that'd help?

Lastly, even if your library can't do this would you be kind enough to point me in the right direction?

Thanks! And thanks so much for this awesome library!

lambdaofgod commented 6 years ago

I'm wondering if your library would easily adapt for that type of data?

That was the idea behind the library.

Are you familiar with Bag of Words model? Or do you want to search using more elaborate features (word embeddings et c)? I can do both, examples, but if you just want to search text using Bag of Words, then you might look into information retrieval libraries like Elasticsearch or whoosh.

jamesmcintyre commented 6 years ago

@lambdaofgod good question! So maybe you can help me answer that because I might be fine with BoW but i'm not sure.

So imagine the data set is the text content of something like github pull requests (not the code but the title, description, comments, user names) or something like an issue ticket (again, the name and description) and I wanted to take one issue or PR and find other issues or PR's similar to it (or even a wiki page similar to it). I know BoW could likely do ok with this but would word embedding (or some other feature extraction) do even better?

EDIT: One last thing, I was also looking for a way in which I could index documents then discard the original document data and the index has none of the original data but instead just features.

I appreciate you responding and your help, I've been trying to learn more ML/NLP over time but there's just so much to learn!

Thanks!