Topics classification - Githubissues

glerzing commented 1 year ago

In order to improve the diversity of our recommandations, or to allow users to filter on specific topics, we need to be able to automatically attribute topics to YouTube videos.

Sources of information include the captions, titles, descriptions, and category id of the videos. The category id is a type of topic that may not be sufficient for our purpose.

glerzing commented 1 year ago

There are a lot of techniques :

1 - Unsupervised algorithms that create topics themselves as groups of words. They usually have efficient implementations, and it the number of documents for each topic is well balanced. But it's quite random, and you have to name yourself the underlying topics based on the output words, and the output topics may not correspond to what you want. I tried Latent Dirichlet Allocation, it's fast and the results are pretty interesting. There is also top2vec.

2 - Unsupervised algorithms that can take a list of topics as input, and output the topic(s) corresponding to a video. I tried with lbl2vec, Lbl2TransformerVec and GPT-3 (curie (= level 3) and davinci (= level 4)). To compare the solutions, I asked them to predict the "category id" of English YouTube videos, which has 15 possible values, like "Entertainment", "Science & Technology" or "Education". A lot of these labels are questionable because there could be multiple answers (I tried myself and only got 5 correct responses out of 10, even though I knew which categories appear frequently). I will search for a better benchmark if I have the time, but here are the accuracy results :

Curie, truncated at 1000 characters of caption (≈ 300 tokens, transformers' input is limited in size) : 5%, less than random chance (7 %) !
Curie, truncated at 2000 characters of caption : still 5 % !
lbl2vec, based on doc2vec : 15 %
Lbl2TransformerVec: 21 %, truncated at 1500 characters of caption
DaVinci, truncated at 5000 characters of caption : 39 % !! But it can become expensive if done on a large batch (0.03 € per 1000 input tokens). It cost me around 3 € for 100 API calls

I didn't add the title and description of the videos, that would probably improve the performances.

3 - Supervised algorithms that output the topic(s) corresponding to a video. But you first need some labelled data.

There is also the problem of handling multiple languages. Some pretrained language models may not have a french version.

The best results (around as good as a human annotator) were obtained by automatically prompting GPT-3 DaVinci using the API and asking which topic best corresponds to the caption. But it's expensive (3 € for 100 labels), so it can be used just to annotate part of the captions. With these labels, we might consider training a supervised algorithm.

I would like to have your opinions on this. We also need to discuss which topics we want to have, or how to generate them.

aidanjungo commented 1 year ago

Just a few comments about this issue:

If we want to use the transcripts to do that in production, we must know if there is a correct way, through the YouTube api? Price? to get them.
Probably we won't want to use GPT-x or other openai model as we spend some time criticize the way they release unsafe mrobably we won't want to use GPT-x or openai models and all the ethical issues that goes with it -> e.g. https://twitter.com/le_science4all/status/1490014328254349323

glerzing commented 1 year ago

There is a caption YouTube API. It's free but there is a quota. And fetching captions quickly depletes the quota (it's 50 times more costly than the metadata, so you can only fetch 200 captions per day). There is a procedure for companies to get a higher quota, but it may be complicated (https://developers.google.com/youtube/v3/guides/quota_and_compliance_audits?hl=fr).
There are not so many great LLMs available out there. And I personally think OpenAI would probably make a better use of that money than other tech giants. But I understand that people here don't like OpenAI. So if you don't want to use an OpenAI key, you can call it the :sunglasses: GPT-glerzing approach. You get some annotated data, and you don't need to know where it comes from.

glerzing commented 1 year ago

Another strategy that I didn't think of is to use the tags : the tags that appear frequently sometimes represent topics that we want to include. Here are the lists of the tags in French and English, sorted by the number of times they appear : fr_tags.csv, en_tags.csv

This could help to determine which topics to use, and be used to trained a supervised model.

I think that's the best solution. Now we need to determine the list of topics that we want to have. If these topics are important, there will probably be videos with this topic as a tag.

amatissart commented 1 year ago

There is a caption YouTube API. It's free but there is a quota. And fetching captions quickly depletes the quota (it's 50 times more costly than the metadata, so you can only fetch 200 captions per day). There is a procedure for companies to get a higher quota, but it may be complicated (https://developers.google.com/youtube/v3/guides/quota_and_compliance_audits?hl=fr).

Do you refer to this API? It's restricted to OAuth authentication with specific scopes. So it's practice it can be used to fetch captions from your own channels, but not on arbitrary videos. Am I missing something?

glerzing commented 1 year ago

You must be right. If so, can we even use the captions ? Can we use JST's tools in production ?

aidanjungo commented 1 year ago

You must be right. If so, can we even use the captions ? Can we use JST's tools in production ?

No, I think we would prefer stick with not too legally blurry methods to get the information we use in production.

glerzing commented 1 year ago

I understand. So do we give up using the transcripts ? We might still be able to assign topics to videos. But much of the #1475 relies on the transcripts, because it doesn't seem wise to assign scores to videos based on superficial criteria, without even analysing the content of the videos.

glerzing commented 1 year ago

I have been working on something else since. But maybe I could give it a try now.

The first step is to define a list of topics. YouTube already provides a categoryId and each video is classified by a label in youtube_topics.txt. But assuming that this is not sufficient, we can add other topics for which we need to make the classification ourselves. Here is a list of additional topics that I suggest : topics.txt

The classification method that gave the best results was to use OpenAI's API. Using gpt-3.5-turbo to classify all the Tournesol's videos would likely cost somewhere like 0.001€ per video (0.002€ / K tokens), so maybe 2 dozens of € for every Tournesol video (assuming we don't use the transcripts). You can propose other APIs if you want, but I don't know how effective / expensive they will be. I was about to propose a more sophisticated solution, but I think this one is simpler and gives better results than fine-tuning our own transformer encoder. There is a bit of prompt engineering (example.txt) and response processing, but it's not complicated.

tournesol-app / tournesol

Topics classification #1468