Identifying GitHub "sample repositories" (SR), that mostly contain educational or demonstration materials supposed to be copied instead of reused as a dependency
Let's build the following pipeline on all! words in README file in order to compare accuracy with embeddings pipeline:
README -> words -> reduce -> tfidf -> vector -> clustering.
Embeddings pipeline now looks like this:
README -> headings -> reduce -> top -> embeddings -> vector -> clustering
If all words will be too abstract for clustering, we can try to reduce scope to headings. As we did with embeddings.
Let's build the following pipeline on all! words in README file in order to compare accuracy with embeddings pipeline:
README -> words -> reduce -> tfidf -> vector -> clustering
. Embeddings pipeline now looks like this:README -> headings -> reduce -> top -> embeddings -> vector -> clustering
If all words will be too abstract for clustering, we can try to reduce scope to headings. As we did with embeddings.