tournesol-app / tournesol

Free and open source code of the https://tournesol.app platform. Meet the community on Discord https://discord.gg/WvcSG55Bf3
https://tournesol.app
Other
325 stars 47 forks source link

Topics classification #1468

Open glerzing opened 1 year ago

glerzing commented 1 year ago

In order to improve the diversity of our recommandations, or to allow users to filter on specific topics, we need to be able to automatically attribute topics to YouTube videos.

Sources of information include the captions, titles, descriptions, and category id of the videos. The category id is a type of topic that may not be sufficient for our purpose.

glerzing commented 1 year ago

There are a lot of techniques :

1 - Unsupervised algorithms that create topics themselves as groups of words. They usually have efficient implementations, and it the number of documents for each topic is well balanced. But it's quite random, and you have to name yourself the underlying topics based on the output words, and the output topics may not correspond to what you want. I tried Latent Dirichlet Allocation, it's fast and the results are pretty interesting. There is also top2vec.

2 - Unsupervised algorithms that can take a list of topics as input, and output the topic(s) corresponding to a video. I tried with lbl2vec, Lbl2TransformerVec and GPT-3 (curie (= level 3) and davinci (= level 4)). To compare the solutions, I asked them to predict the "category id" of English YouTube videos, which has 15 possible values, like "Entertainment", "Science & Technology" or "Education". A lot of these labels are questionable because there could be multiple answers (I tried myself and only got 5 correct responses out of 10, even though I knew which categories appear frequently). I will search for a better benchmark if I have the time, but here are the accuracy results :

I didn't add the title and description of the videos, that would probably improve the performances.

3 - Supervised algorithms that output the topic(s) corresponding to a video. But you first need some labelled data.

There is also the problem of handling multiple languages. Some pretrained language models may not have a french version.

The best results (around as good as a human annotator) were obtained by automatically prompting GPT-3 DaVinci using the API and asking which topic best corresponds to the caption. But it's expensive (3 € for 100 labels), so it can be used just to annotate part of the captions. With these labels, we might consider training a supervised algorithm.

I would like to have your opinions on this. We also need to discuss which topics we want to have, or how to generate them.

aidanjungo commented 1 year ago

Just a few comments about this issue:

glerzing commented 1 year ago
glerzing commented 1 year ago

Another strategy that I didn't think of is to use the tags : the tags that appear frequently sometimes represent topics that we want to include. Here are the lists of the tags in French and English, sorted by the number of times they appear : fr_tags.csv, en_tags.csv

This could help to determine which topics to use, and be used to trained a supervised model.

I think that's the best solution. Now we need to determine the list of topics that we want to have. If these topics are important, there will probably be videos with this topic as a tag.

amatissart commented 1 year ago

Do you refer to this API? It's restricted to OAuth authentication with specific scopes. So it's practice it can be used to fetch captions from your own channels, but not on arbitrary videos. Am I missing something?

glerzing commented 1 year ago

You must be right. If so, can we even use the captions ? Can we use JST's tools in production ?

aidanjungo commented 1 year ago

You must be right. If so, can we even use the captions ? Can we use JST's tools in production ?

No, I think we would prefer stick with not too legally blurry methods to get the information we use in production.

glerzing commented 1 year ago

I understand. So do we give up using the transcripts ? We might still be able to assign topics to videos. But much of the #1475 relies on the transcripts, because it doesn't seem wise to assign scores to videos based on superficial criteria, without even analysing the content of the videos.

glerzing commented 1 year ago

I have been working on something else since. But maybe I could give it a try now.

The first step is to define a list of topics. YouTube already provides a categoryId and each video is classified by a label in youtube_topics.txt. But assuming that this is not sufficient, we can add other topics for which we need to make the classification ourselves. Here is a list of additional topics that I suggest : topics.txt

The classification method that gave the best results was to use OpenAI's API. Using gpt-3.5-turbo to classify all the Tournesol's videos would likely cost somewhere like 0.001€ per video (0.002€ / K tokens), so maybe 2 dozens of € for every Tournesol video (assuming we don't use the transcripts). You can propose other APIs if you want, but I don't know how effective / expensive they will be. I was about to propose a more sophisticated solution, but I think this one is simpler and gives better results than fine-tuning our own transformer encoder. There is a bit of prompt engineering (example.txt) and response processing, but it's not complicated.