tl-its-umich-edu / annoto-gai

This is Github Project to Annoto GAI work
0 stars 2 forks source link

Handling videos with no clear topic consensus #29

Closed takposha closed 4 months ago

takposha commented 4 months ago

Some videos might have topics extracted from them but deemed to not be cohesive enough to qualify as a singular topic of discussion in the video.

The proper fix for this is to adjust BERTopic to attempt a topic consolidation if no clear topics are found. If still no consensus is achieved on the topics, then an error can be raised.

Another longer implementation can also involve using the sub-topics that are being generated to generate questions. This can provide a lot of options and flexibility for fine-tuning and choosing the kinds of questions we want to generate and provide.

takposha commented 4 months ago

A working solution for this has been implemented.

Switching the clustering model over to K_means instead of HBDScan ensures that a minimum number of topics are generated. Having the vectorizer model helps out as well, as it significantly cuts down on noise with the K_means algorithm.

K_means is inferior compared to HBDScan in most cases and is only used as a fallback option. It can often lead to two topics with the same title being generated. This can be fixed by appending a count to duplicate title to indicate that they are different. However, a better solution might be to have a longer topic title generated using LangChain that is more descriptive.

A PR can be opened to merge the code, depending on the priority of other issues we would like to address first.