Closed h1alexbel closed 7 months ago
Some of these algorithms I found so far:
Let's investigate the fundamental difference between LR, RF and KNN algorithms, and what is the best option for our use-case
I performed a small comparison on traditional ML classification algorithms vs. HuggingFace Transformers used for text-classification. The key difference to my knowledge, is in the data format that we should provide to the model during the training stage. For instance, for RF (Random Forest) we can provide these information:
full_name
description
created_at
last_commit
readme
label
model, with the help of RF learning algorithm, will build the trees based on these data and marked label
value for each that row. To me, it's more flexible approach for our case, since we can use not only text data, but numeric and other data too.
When it comes to the text transformers like HuggingFace, they accept only text and can reason about text data. In our case it's a readme
. So dataset we would provide to such model will look like this:
readme
label
model, using deep learning and self-attention mechanisms would learn the textfield READMEs together with their
marked labels. This approach is quite powerful, more resource-needed and to me looks less flexible. Why? In the case of
text transformers we can classify GitHub repositories in SAMPLE
or REAL
only using it's README. However, some of the repositories don't have README. So, they will be skipped or suboptimal analyzed. The solution I see, is too put more generic data about GitHub repository in this text
for analysis. Let's say, we scrape the GitHub repository page, for instance yegor256/takes
and gather all required information for to the model (similarly to the data format that used for RF algorithm). Now, we can present this data to the model in prompt way (like we do with ChatGPT and similar models), and it should give us the prediction.
RQ2: What is the more performant in classification way? Machine Learning RF algorithm vs. transformers
Let's investigate learning algorithm and process for our classification task. This step is crucial to gather fundemental understanding what is the process of learning and how classification will work inside the algorithm.