piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.55k stars 4.37k forks source link

Cap training by time elapsed, not iterations/params #451

Open piskvorky opened 9 years ago

piskvorky commented 9 years ago

Time elapsed ("wallclock time") is often the most relevant resource to compare alternative models/algorithms on. Is a model trained on param set "A" better than one train on parameter set "B", given the same training time (on the same HW)?

So, add an option to allow training a model (word2vec, LDA, LSI, whatever) that continues for a specified amount of time. Don't ask users to specify the number of passes / model updates / whatever upfront -- instead, train as much as possible within the allotted time.

We don't care about millisecond precision here -- stopping at "roughly" the allotted time is fine, give or take a couple of minutes (realistic training times on realistic corpora take hours+).

menshikh-iv commented 6 years ago

@piskvorky potentially this is a great feature (I liked similar thing - time-limited "approximate" aggregation queries for DB), but how it's possible to implement? Different hardware / many parameters that have a huge influence on the process. Have you any ideas about it?

gojomo commented 6 years ago

For many algorithms the 1st training epoch will give an accurate estimate of the required time for any future epochs... so it'd be relatively easy to, at the end of the first epoch, adjust the number of future epochs to target a user-set elapsed-time goal. (There should also be a warning if this results in a possibly-unwanted corner case, like just 0 or 1 more epochs.)

If this was too coarse – eg each epoch takes hours, and more precision deemed necessary – some algorithms might do fine with random skipping of examples from the corpus at a rate that escalates until the projected finish time is in the desired window.

More generally, this feature could be mesh well with an estimated-time-to-finish feature (either as logged output, or an interrogable property of a model-in-training). May be worth comparing with https://github.com/tqdm/tqdm's progress-estimation options.