Fixes #35 - Githubissues

takposha commented 4 months ago

This PR fixes #35.

This turned out to be a longer venture than initially expected, as the QuestionData class and the functions it uses had to be reworked significantly. This is because the original implementation was built around only a fixed number of questions, one per topic needing to be generated. However, this was quite limiting and didn't allow for an easy way to track the information for handling a dynamic number of questions.

The new implementation now uses Dataframes inside the classes to store all associated information for a given generated question, instead of being spread across separate variables and objects. So in this case, the new variable responseInfo contains all information about the times, the topic and keywords, the truncation history, relevant text, the question query, and the actual response from ChatGPT containing the generated question. This makes it a lot easier to track and handle. A direct improvement that can be made is to question generation itself, where the keywords associated with the topic can be now passed to the prompt as well, to further refine the question that is to be generated. This will be implemented in a later PR as this one is quite large already.

As a result, the new implementation allows for an option of setting a minimum number of questions that can be attempted to be generated for a given transcript. "Attempt" is important here. When BERTopic assigns a topic to a given block of text, it also calculates the frequency of the topic within the text. We want text with a topic that occurs more frequently, as it indicates that it is more relevant to the topic. There can be cases where topics are not frequent, and hence it might not be ideal to use the text to generate a question for that topic. A default threshold has been set to 2, which means that if a topic occurs more than twice in a region of text, it can be used for question generation. Overall, this now allows for a user to have some choice in the number of questions to generate, but the option to stick to only one per topic is still an option in configuration.

A bug now exists in the way data is saved and loaded. The saving and loading mechanism currently only tracks if data exists for a given transcript, and attempts to load it, irrespective of the number of questions that were configured to be generated. A simple workaround is to be aware that all data needs to be generated from scratch when the parameters are changed and set the overrides as required, but maybe having a way to automatically monitor such changes in the configuration to retrieve past questions might also be useful. This might be fixed in the future, depending on the complexity and priority of other issues raised during the testing phase.

pushyamig commented 4 months ago

I will review this now.

pushyamig commented 4 months ago

The change looks good, but there is an syntax error that I needs to be fixed. Otherthan that this is good to go.

tl-its-umich-edu / annoto-gai

Fixes #35 #36