h1alexbel / sr-detection

Identifying GitHub "sample repositories" (SR), that mostly contain educational or demonstration materials supposed to be copied instead of reused as a dependency
MIT License
6 stars 0 forks source link

tfidf pipeline #37

Open h1alexbel opened 4 weeks ago

h1alexbel commented 4 weeks ago

Let's build the following pipeline on all! words in README file in order to compare accuracy with embeddings pipeline: README -> words -> reduce -> tfidf -> vector -> clustering. Embeddings pipeline now looks like this: README -> headings -> reduce -> top -> embeddings -> vector -> clustering

If all words will be too abstract for clustering, we can try to reduce scope to headings. As we did with embeddings.

h1alexbel commented 4 weeks ago

This source can be relevant: https://web.stanford.edu/~jurafsky/slp3/6.pdf Here more practical guide: https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency.