Limit simultaneous training requests

Horsmann commented 5 years ago

[x] training of only one model at a time
[x] training should be async and return immediately

Horsmann commented 5 years ago

@Rentier Training should run async now. If training runs, additionally incoming requests just slips through with the 'too many job' return code.

jcklie commented 5 years ago

Nice. Did you try it with INCEpTION?

Horsmann commented 5 years ago

No, I am actually not sure what I have to setup. Once this TomCat stuff is running how do I reach a setup where n recommender is actually used for something? I probably have to define a project of some sort?

I tested with curl.

jcklie commented 5 years ago

You can run INCEpTION.java in inception-app-webapp or so, no need for Tomcat. Then you create a project, import files and define a recommender in the project settings.

Horsmann commented 5 years ago

Following problem came up, which I cannot directly solve.

I use CrfSuite as backend at the moment. Apparently, crfsuite does not like it if it is called simultaneously for training and prediction. The pipes break and you have a good chance that prediction and training fail together if this happens simultaneously. This origins probably in the way the binary is implemented. I vaguely recall similar issues in the past. As long as you ensure a single call at a time this seems to works. Simultaneous predictions seem to work.

@reckart I assume the RuntimeProvider is smart enough to figure out that a binary is already available on the file system and just re-uses it, right? Thus, when training is running and a prediction request comes in, the same binary will be picked by the runtime provider that is already used for training? Is the RuntimeProvider configurable in a way that each request can be served with an own copy of the binary? Maybe this helps, I am not sure but this is something I could try. At the moment, the i/o streams of the binary crash when train/pred occur together.

@Rentier As quick-fix, I could let the prediction return an temporarily not available code in order to wait the time the model is training. This would also mean that if there is actually a lot request-traffic for both, train and prediction it might take some time until a process catches a free spot for getting served .

There is also no really well suited sequence classification alternative in TC. SvmHmm does do sequence classification and VowpalWabbit, the former scales poorly and the latter does not reach state of the art results.

Horsmann commented 5 years ago

I might have found a solution but this requires an upgrade to the latest TC snapshot.

An issue seems to remain; a race condition. When the model has finished training and writes to disc and a prediction request comes in that requests the model that is being written we get a problem. I still have to look into this one.

Horsmann commented 5 years ago

@Rentier what is the best behavior if a model is requested that is not available either because it does not exist or is currently being trained? Return with no prediction or a try-again-later return code?

I will have to add some additional logic to deal with requests for models that are being (re)trained. Question is should I rather wait for training to finish our bail-out early and just return with nothing?

reckart commented 5 years ago

@reckart I assume the RuntimeProvider is smart enough to figure out that a binary is already available on the file system and just re-uses it, right?

Once install() has been called, additional calls to install() have no effect unless uninstall() is called in between. If you want every request to use its own copy of binaries, you just have to create a new instance of the RuntimeProvider. Don't forget to call uninstall() when you don't need the runtime anymore, otherwise they will accumulate on disk.

jcklie commented 5 years ago

@Horsmann I would return a 412 or so and bail out. For the case of retraining: is it possible to use the old model while the training is not finished? When its finished, then you could replace the old with the new model.

reckart commented 5 years ago

For the case of retraining: is it possible to use the old model while the training is not finished? When its finished, then you could replace the old with the new model.

That is a good idea 👍

Horsmann commented 5 years ago

ok; so prediction might now return a PRECONDITION_FAILED if

a) either a new model is written to disc in this moment, regardless if the prediction tries to use this model or not. During disk-write no predictions are served (this should take no very long a second or two maybe in which the old model is removed and replaced with the new one).

b) no model is available that provides prediction for the requested information/layer

@Rentier You probably have to retrain the models. Moving up to the snapshot probably has some changes somewhere that lets old models work no more~

Horsmann commented 5 years ago

@reckart Is there a reason why UKP's jenkins fails all the time while ours builds without problems?

reckart commented 5 years ago

@Horsmann I'm sure there is, but I don't know what the reason is (I didn't investigate). Try adding more logging to see what happens.

Horsmann commented 5 years ago

@Rentier Inception seems to basically work.

What I noticed, even for zero-annotations, inception sends training requests; if there is no data annotated yet inception shouldn't ask for training. I can catch this on my side but I would have to prematurely deserialize the CAS and check if there is anything in there for training. Could you prevent train requests without actual content?

jcklie commented 5 years ago

I will look into that.

Horsmann commented 5 years ago

Thanks. Otherwise this looks goods :).

I increased the number of log messages a bit but this should work now. Once this is merged, we can close this issue.

inception-project / external-recommender-dkpro-tc

Limit simultaneous training requests #4