h1alexbel / samples-filter

Command-line filter for GitHub repositories that contain "samples", instead of real project or framework or library
MIT License
5 stars 0 forks source link

Investigate learning algorithm for real/sample repo classification #75

Closed h1alexbel closed 7 months ago

h1alexbel commented 7 months ago

Let's investigate learning algorithm and process for our classification task. This step is crucial to gather fundemental understanding what is the process of learning and how classification will work inside the algorithm.

h1alexbel commented 7 months ago

Some of these algorithms I found so far:

h1alexbel commented 7 months ago

Let's investigate the fundamental difference between LR, RF and KNN algorithms, and what is the best option for our use-case

0pdd commented 7 months ago

@h1alexbel 2 puzzles #81, #82 are still not solved.

h1alexbel commented 7 months ago

I performed a small comparison on traditional ML classification algorithms vs. HuggingFace Transformers used for text-classification. The key difference to my knowledge, is in the data format that we should provide to the model during the training stage. For instance, for RF (Random Forest) we can provide these information:

model, with the help of RF learning algorithm, will build the trees based on these data and marked label value for each that row. To me, it's more flexible approach for our case, since we can use not only text data, but numeric and other data too.

When it comes to the text transformers like HuggingFace, they accept only text and can reason about text data. In our case it's a readme. So dataset we would provide to such model will look like this:

model, using deep learning and self-attention mechanisms would learn the textfield READMEs together with their marked labels. This approach is quite powerful, more resource-needed and to me looks less flexible. Why? In the case of text transformers we can classify GitHub repositories in SAMPLE or REAL only using it's README. However, some of the repositories don't have README. So, they will be skipped or suboptimal analyzed. The solution I see, is too put more generic data about GitHub repository in this text for analysis. Let's say, we scrape the GitHub repository page, for instance yegor256/takes and gather all required information for to the model (similarly to the data format that used for RF algorithm). Now, we can present this data to the model in prompt way (like we do with ChatGPT and similar models), and it should give us the prediction.

h1alexbel commented 7 months ago

RQ2: What is the more performant in classification way? Machine Learning RF algorithm vs. transformers

0pdd commented 7 months ago

@h1alexbel 5 puzzles #81, #82, #88, #89, #90 are still not solved.

0pdd commented 7 months ago

@h1alexbel 4 puzzles #81, #82, #89, #90 are still not solved; solved: #88.

0pdd commented 7 months ago

@h1alexbel 2 puzzles #81, #82 are still not solved; solved: #88, #89, #90.

0pdd commented 7 months ago

@h1alexbel 3 puzzles #104, #81, #82 are still not solved; solved: #88, #89, #90.

0pdd commented 6 months ago

@h1alexbel 3 puzzles #119, #81, #82 are still not solved; solved: #104, #88, #89, #90.

0pdd commented 6 months ago

@h1alexbel all 7 puzzles are solved here: #104, #119, #81, #82, #88, #89, #90.