I have updated the _train_topicmodel function to take any argument accepted by mallet as kwargs with the only difference being that arguments have '-' changed for '_', i.e. num-iterations --> _numiterations. Numeric values can be passed as either numeric or strings.
It is backwards compatible keeping all of the mandatory arguments. The only thing removed is the default value of --optimize-interval 10 from within the function. Instead it uses mallets default value of 0 and can be set manually by adding _optimizeinterval=10 as an argument. This was done in order to allow for the user to specify hyperparameter values themselves in case optimization is not wanted (for example by setting alpha=0.05).
The functionality to return loss function values gathered during training has also been added. If logperplexity=True, loss values will be scraped from the output and returned in a list (they are still printed as usual). This option is by default set to False.
The subprocess module is used to get loss values. I saw #2 and had the same issue on mac but managed to resolve it (no delay for printing output). I have not tested it on windows however (but think it should work?). If it does not work however, an option could be to use os by default unless logperplexity=True.
Hello, thank you so much for adding to this project! 🙏
I think some of these changes are very useful but others might take some more thought/testing.
Adding the diagnostics file - awesome!
Adding optional arguments - awesome!
Switching to subprocess - unless we can test on Windows, I'd prefer not to switch. This is meant to be an accessible package (e.g., for teaching), and it needs to run smoothly on Windows. I don't currently have time or access to test on Windows; maybe someone else could do this?
Returning log likelihood - maybe. I'd prefer to send this to an output file rather than giving it priority by returning directly from the function. It doesn't have great correlation with human judgments, and sometimes people over-rely on this metric.
Removing the optimization default - needs changes. Again, the goal is accessibility, and I want people to have the best chance of getting good topic output without having to be an expert. Turning on hyperparameter optimization by default supports those goals. I think it would make more sense to add this as an optional argument to the function with the default still set to 10.
I have updated the _train_topicmodel function to take any argument accepted by mallet as kwargs with the only difference being that arguments have '-' changed for '_', i.e. num-iterations --> _numiterations. Numeric values can be passed as either numeric or strings.
It is backwards compatible keeping all of the mandatory arguments. The only thing removed is the default value of --optimize-interval 10 from within the function. Instead it uses mallets default value of 0 and can be set manually by adding _optimizeinterval=10 as an argument. This was done in order to allow for the user to specify hyperparameter values themselves in case optimization is not wanted (for example by setting alpha=0.05).
The functionality to return loss function values gathered during training has also been added. If logperplexity=True, loss values will be scraped from the output and returned in a list (they are still printed as usual). This option is by default set to False.
The subprocess module is used to get loss values. I saw #2 and had the same issue on mac but managed to resolve it (no delay for printing output). I have not tested it on windows however (but think it should work?). If it does not work however, an option could be to use os by default unless logperplexity=True.