Investigate learning algorithm for real/sample repo classification

h1alexbel commented 7 months ago

Let's investigate learning algorithm and process for our classification task. This step is crucial to gather fundemental understanding what is the process of learning and how classification will work inside the algorithm.

h1alexbel commented 7 months ago

Some of these algorithms I found so far:

Logistic Regression: Simple and effective for binary classification tasks.
Random Forest: An ensemble method that combines multiple decision trees.
K-Nearest Neighbours (KNN)

h1alexbel commented 7 months ago

Let's investigate the fundamental difference between LR, RF and KNN algorithms, and what is the best option for our use-case

0pdd commented 7 months ago

@h1alexbel 2 puzzles #81, #82 are still not solved.

h1alexbel commented 7 months ago

I performed a small comparison on traditional ML classification algorithms vs. HuggingFace Transformers used for text-classification. The key difference to my knowledge, is in the data format that we should provide to the model during the training stage. For instance, for RF (Random Forest) we can provide these information:

full_name
description
created_at
last_commit
readme
label

model, with the help of RF learning algorithm, will build the trees based on these data and marked label value for each that row. To me, it's more flexible approach for our case, since we can use not only text data, but numeric and other data too.

When it comes to the text transformers like HuggingFace, they accept only text and can reason about text data. In our case it's a readme. So dataset we would provide to such model will look like this:

readme
label

model, using deep learning and self-attention mechanisms would learn the textfield READMEs together with their marked labels. This approach is quite powerful, more resource-needed and to me looks less flexible. Why? In the case of text transformers we can classify GitHub repositories in SAMPLE or REAL only using it's README. However, some of the repositories don't have README. So, they will be skipped or suboptimal analyzed. The solution I see, is too put more generic data about GitHub repository in this text for analysis. Let's say, we scrape the GitHub repository page, for instance yegor256/takes and gather all required information for to the model (similarly to the data format that used for RF algorithm). Now, we can present this data to the model in prompt way (like we do with ChatGPT and similar models), and it should give us the prediction.

h1alexbel commented 7 months ago

RQ2: What is the more performant in classification way? Machine Learning RF algorithm vs. transformers

0pdd commented 7 months ago

@h1alexbel 5 puzzles #81, #82, #88, #89, #90 are still not solved.

0pdd commented 7 months ago

@h1alexbel 4 puzzles #81, #82, #89, #90 are still not solved; solved: #88.

0pdd commented 7 months ago

@h1alexbel 2 puzzles #81, #82 are still not solved; solved: #88, #89, #90.

0pdd commented 7 months ago

@h1alexbel 3 puzzles #104, #81, #82 are still not solved; solved: #88, #89, #90.

0pdd commented 6 months ago

@h1alexbel 3 puzzles #119, #81, #82 are still not solved; solved: #104, #88, #89, #90.

0pdd commented 6 months ago

@h1alexbel all 7 puzzles are solved here: #104, #119, #81, #82, #88, #89, #90.

h1alexbel / samples-filter

Investigate learning algorithm for real/sample repo classification #75