Closed takposha closed 5 months ago
The above commit handles the comments raised in the PR. However, some things based on this that I will address in the next PR:
There is a bug in the TopicModeller
class when it comes to saving that I haven't been able to replicate consistently and squash, but I am looking into it,
In some cases, the BERTopic model will fail to save, with an IoError stating that files are already open.
This appears to happen when the TopicModeller
object hasn't been destroyed and a new one tries to access the same model again, but I need to test this to confirm. The solution here will be to simply delete the whole object on completion of the process. This isn't done here as we need it for the future question generation component.
But I would like to look more into this, as model saving, and tracking token counts is not as clean a solution as I'd hoped for.
Having a smaller PR next time would make the review process easier.
That is the plan, hopefully. The next PR will be for just the question generation, which should be much smaller.
I will get this merged and get that going next.
This PR primarily handles topic extraction as outlined in #9. BERTopic is primarily used alongside KeyBERT and LangChain to extract topics discussed in a video transcript over time, and the corresponding model for topic extraction as well. This will be used for generating questions in a future Issue. It also fixes issues raised in #12 and #13, For #12: A separate Python script
captionsProcessor.py
has been added that performs all the steps similar to thecaptionsProcessor.ipynb
notebook. As of now, variables for configuration are modified in the script itself. Maybe a future issue can be raised for adding more configurability from environmental variables. For #13: Thehelper.py
script has been split into smaller modular files to better reflect the functions' uses for different parts of the code. These include thetranscriptLoader.py
,topicExtractor.py
, andutils.py
.