inception-project / inception

INCEpTION provides a semantic annotation platform offering intelligent annotation assistance and knowledge management.
https://inception-project.github.io
Apache License 2.0
591 stars 150 forks source link

Support DKPro TC External Recommender #522

Closed jcklie closed 5 years ago

jcklie commented 6 years ago

Right now, the DKPro TC external recommender has some issues which make it difficult to use:

1) When predicting, it looks like INCEpTION often lets the HTTP connection time out, because the external recommender takes some time to predict. The task on TC server side seems to not stop then, therefore starting more and more tasks, leading to out of memory for TC. Increasing this timeout would have the problem that the PredictionTask is syncronous, therefore it blocks the other recommenders from recommending. This is a problem for all recommender and need to be addressed, e.g. by making recommender asynchronous and independent from each other. 2) Training is not fast. tried to train on the large Germeval 2014 de with 4GB heap, but after 15min it did not finish. I propose that we add a flag to INCEpTION recommenders whether they are trainable or not. 3) I want to change the predicting request to only use one document at a time to better reflect the current recommender Java API in INCEpTION.

reckart commented 6 years ago

When predicting, it looks like INCEpTION often lets the HTTP connection time out, because the external recommender takes some time to predict.

Normally, prediction should be relatively fast. What timeout do we have?

The task on TC server side seems to not stop.

That sounds like a problem with the TC service, i.e. not with INCEpTION.

Training is not fast. tried to train on the large Germeval 2014 de with 4GB heap, but after 15min it did not finish. I propose that we add a flag to INCEpTION recommenders whether they are trainable or not.

Even a slowly training recommender should be trainable. Optimisations could be made by limiting the number of documents used for training, but actually I think as a user I would like my recommenders to be trained on all data, even if they are slow.

The training of the recommenders could be parallelized though, such that fast-training recommenders (e.g. string matching) can produce their results faster than slow-training recommenders. Also fast-training recommenders may update their recommendations more often than. This is currently not done because all predictions go into the same data object and would overwrite each other - I had a naive prototype for this but reverted if for this reason. We can do it, but it requires a bit more of refactoring.

Horsmann commented 6 years ago

I can have look into the training of the model to track what takes so much time but a rather large dataset might actually take some time.

@Rentier Could you send me the json-request of the full Germeval train-request (or a dropbox link to this file if it exceeds E-mail size)?

The dataset is quite large if I recall, I am not sure if it is the TC-part (i.e. feature-extraction, I/O) that is slow or if training the model in CRFsuite, the backend classifier, just takes some time on this data amount.

jcklie commented 6 years ago

The timeout comes from the HTTP socket. I use the large germeval request from the test resources folder.

Horsmann commented 6 years ago

I out-commented the character ngram feature for the moment. Collecting all char-ngrams on large text collections takes some time. It uses word information only at the moment. This should train in about 25 seconds for the above GermEval-NER training request.

The problem is probably that iterating and splitting char-ngrams requires iterating all text, splitting in the requested lengths and writing the information to an index on disc. This mainly serves the frequency-based selection that is used, i.e. use only the most frequent N char-bi-grams. This avoids hacking each pre/suffix feature manually but also means that the feature as to run over the entire text and collect the ngrams before the actual feature are/can be created.

@Rentier please pull and build a new .jar, this should train a lot faster now. It is probably reasonable to pre-train models if large datasets are already available.

Edit: With character ngrams, model training takes about 5-6 minutes

Horsmann commented 6 years ago

I think I can lower training time a bit by some tuning in the backend. I will soon make a new minor release with a few performance hacks. Training time is than more around 60 to 90 seconds for the GermEval request, which is probably still too much for a synchronous server but a lot quicker than before.

jcklie commented 6 years ago

What we can do from the INCEpTION side is running training asynchronously, as we do not have to wait or care for a result. For prediction, this is not so easy.

reckart commented 6 years ago

@Rentier For the actual, we can send all the training data and do not have to wait for the training to complete. It would be good though to be able to figure out whether the training was successful. Also, during the evaluation phase, we need to be able to pause the evaluation while the training is running so we can evaluate the trained model on the test data. Any suggestions?

Horsmann commented 6 years ago

whether the training was successful

The server returns HttpStatus.OK in case nothing unusual happened, i.e. no errors, otherwise HttpStatus.INTERNAL_SERVER_ERROR is returned in case any exceptions is thrown. Thus, unless you get an OK you can assume training failed.

btw. I released DKPro TC 1.0.2 and changed the recommender code to use the new version. Its considerably faster now, even with character ngrams.

reckart commented 6 years ago

The server returns HttpStatus.OK in case nothing unusual happened, i.e. no errors, otherwise HttpStatus.INTERNAL_SERVER_ERROR is returned in case any exceptions is thrown. Thus, unless you get an OK you can assume training failed.

That works for synchronous HTTP calls, but not for asynchronous processing. In the async case, the HTTP call would consume all the training data, put it into a local spooler and immediately return while training runs from the data in the spooler. In that case, the HTTP call would only be able to immediately report an error if there was a problem during spooling, but not if there is a problem during the actual training.

jcklie commented 6 years ago

The basic integration works now, the prediction looks good. The training is much faster and did not throw errors, but I have to observe it more.

reckart commented 5 years ago

@Rentier @Horsmann

Closing this for the time being as it seems to be in general resolved. Please reopen or open new issue if required.