tl-its-umich-edu / annoto-gai

This is Github Project to Annoto GAI work
0 stars 2 forks source link

Code to address Issues - #9, #12, #13 #14

Closed takposha closed 5 months ago

takposha commented 5 months ago

This PR primarily handles topic extraction as outlined in #9. BERTopic is primarily used alongside KeyBERT and LangChain to extract topics discussed in a video transcript over time, and the corresponding model for topic extraction as well. This will be used for generating questions in a future Issue. It also fixes issues raised in #12 and #13, For #12: A separate Python script captionsProcessor.py has been added that performs all the steps similar to the captionsProcessor.ipynb notebook. As of now, variables for configuration are modified in the script itself. Maybe a future issue can be raised for adding more configurability from environmental variables. For #13: The helper.py script has been split into smaller modular files to better reflect the functions' uses for different parts of the code. These include the transcriptLoader.py, topicExtractor.py, and utils.py.

takposha commented 5 months ago

The above commit handles the comments raised in the PR. However, some things based on this that I will address in the next PR:

  1. Rework the config class to better handle the variables being used. I think I can improve how the variables are being passed between functions to be a bit clearer.
  2. Insert more robust checks for valid and existing files and authentications. There are no checks to validate if a video file exists, or if credentials are valid, which needs to be implemented.
takposha commented 5 months ago

There is a bug in the TopicModeller class when it comes to saving that I haven't been able to replicate consistently and squash, but I am looking into it, In some cases, the BERTopic model will fail to save, with an IoError stating that files are already open. This appears to happen when the TopicModeller object hasn't been destroyed and a new one tries to access the same model again, but I need to test this to confirm. The solution here will be to simply delete the whole object on completion of the process. This isn't done here as we need it for the future question generation component.

But I would like to look more into this, as model saving, and tracking token counts is not as clean a solution as I'd hoped for.

takposha commented 5 months ago

Having a smaller PR next time would make the review process easier.

That is the plan, hopefully. The next PR will be for just the question generation, which should be much smaller.

I will get this merged and get that going next.