header model training slow

de-code commented 5 years ago

Hi Patrice,

This is what I raised on Mattermost but I thought it's good to have an issue to have the information together.

Since GROBID 0.5.4 it should use all of the available threads and we have done some tests with VMs having multiple CPUs. The logging confirmed that it was using the the expected number of threads.

With 64 CPUs it was only using between 20-30 CPUs. So we ran it again with 32 CPUs. It was using almost all of the CPUs but the overall run-time wasn't as fast as expected. On a single CPU it takes about 24h to train the header model. With 32 available CPUs, it was down to around 16h.

You could replicate it by building GROBID etc. or using the Docker container:

docker run --rm -it \
    elifesciences/sciencebeam-trainer-grobid_unstable:0.5.4 \
    train-header-model.sh \
        --use-default-dataset

Or if you were using Kubernetes:

kubectl run --rm --attach --restart=Never --generator=run-pod/v1 \
    --image=elifesciences/sciencebeam-trainer-grobid_unstable:0.5.4 \
    train-header-model -- \
    train-header-model.sh \
        --use-default-dataset

There is a slight overhead with --use-default-dataset because it copies the default database. But it doesn't take that long. (The main use-case is to user a custom dataset - alone or together with the default dataset)

kermitt2 commented 5 years ago

Hi Daniel,

Sorry the late feedback. I did some tests this week with GROBID 0.5.5 (no change regarding training and header model as compared to 0.5.4), and I have a similar time for training the model with 24CPU (15 hours 52 mn, 26s per epoch). See numbers bellow, runtimes depend on the machine, but are consistent. As you can see, I have much longer training time for 1 CPU. I am surprised that you could get 24 hours only with 1 CPU. I tested the one-CPU on different servers, always a couple of times more than 24 hours.


* server    traces4, 24 CPU, 64GB       

number CPU  1   2   4   8   12  16  24
second/epoch    153 86  55  37  32  28.5    26
total minute    -   -   -   1329    1229    -   952

* desktop   work, 8 CPU, 16GB                       

number CPU  1   2   4   8           
second/epoch    107 58  38  32          

* server    traces5, 16 CPU, 16GB                       

number CPU  4   16                  
second/epoch    54  33                  
total minutes   -   1333

Following my tests, training time reduces relatively well when adding more CPU, from 86 hours to the 16 hours plateau for instance with the same server.

training-header-runtime

So I think, overall, it behaves okay, although for smaller models, increasing the number of CPU has a much better runtime impact from what I have observed. I guess the computing parts that are not parallelized grow with the number of features and larger models benefit less from adding more CPU.

I've observed that the available RAM has an impact, as well as the machine (my desktop has very good CPU! but I am broke now).

So, can we improve training time for this model?

This model is very big (81M features) and there are certainly some possibilities to prune significantly the number of features for a similar results (reduce the vocabulary, simplify a bit the feature template, remove useless annotated examples).

Another obvious approach is to work on the stopping criteria. I fixed them arbitrarily high (2000 epochs) not to care about them, but we could optimize them according to the training data (simply 1000 epoch might not have an impact on performance an ymore, but this is to be evaluated).

We could also explore some optimizations in Wapiti. e.g. for high number of CPU (because going beyond 20-25 CPU has no impact any more).

kermitt2 commented 5 years ago

As a complementary information, I also experimented with a BidLSTM-CRF architecture using Gloves embeddings for the header model. One issue here is that the input sequence to be labelled is super large, because it's the whole header (more than 1000 tokens is usual but it could go up to 3000 tokens), so the batch size for training has to be decreased a lot and we need a plenty of GPU memory to avoid some labels not occurring at all in a training batch.

Training is considerably faster. Training time is less than 2 hours with a nvidia GTX 1080 if I remember well, but accuracy is also very low for the moment, we have a loss of more than 10 points for f-score in average for the fields as compared to CRF+layout features. BERT fine-tuning (using SciBERT in our case) would be even significantly faster than that, a few minutes (!).

NN models are also smaller: 1.6MB with DeLFT BidLSTM-CRF versus 35MB for CRF, though DeLFT is particularly good for model size - however BERT fine-tuned models are ridiculously big, 1.3GB.

Overall I think using DL models is the way to go for long term to reduce dramatically training time, but there are still design and technical issues to address, in particualr how to apply this kind of approach to document-level input (so header/fulltext models) and how to incorporate layout information efficiently (just concatenating those features does not appear to work from my first experiments, I've not tested multi-channel architectures yet).

de-code commented 5 years ago

Thank you for looking into it Patrice.

It is interesting you are seeing much higher training time with a single CPU. I could test it again. According to your chart it's 26 hours with 24 CPUs or am I reading that wrong? Something seem to be different.

In between epochs there seem to be a short period where it uses only one CPU. That probably explains why it plateaus. Maybe that could be improved. But maybe the effort is better spent on the DL model.

Maybe you could talk me through or point me to the documentation to experiment with a DL myself? (Although from our last conversation you suggested that we need the header mode to use the Clusteror to have a chance of showing better performance)

kermitt2 commented 5 years ago

In the chart, the Y-axis gives the average training time in second per epoch - all the training take 2000 epochs. With 24CPU, one epoch is around 26s and the total takes around 16 hours (there are some extra time for parsing the TEI, loading the data, init,...).

I agree about the explanation for the plateau, it would be indeed the place to look in Wapiti!

About the DL stuff, the documentation is a bit short -> https://grobid.readthedocs.io/en/latest/Deep-Learning-models/

Luca started to extend the mechanism to macOS and when DeFLT is installed using a virtual env. (branch jep_macOs). However if the training runtime is very good, results are much lower than Wapiti currently:

https://github.com/kermitt2/grobid/blob/master/grobid-trainer/doc/PMC_sample_1943.results.grobid-0.5.4-DeLFT-29.12.2018

jribault commented 5 years ago

Hi,

Could you take a look at your CPU usage (using top, then press 1 to see each core activity) ? I have a model and data not so big as it takes around 5h to complete. On my 16 cores, only 3 are used. Do you experience the same thing ? As I understand the training is done by wapiti so it's probably a wapiti optimization but I just want to make sure that I'm not the only one. Also, maybe to use more core the model have to be more complex or with more data ?

kermitt2 commented 5 years ago

Hi @jribault ! The efficiency of the usage of core depends on the number of examples too. For instance when training header or body sections, if there are not enough examples to create minimal partitions of data to be distributed, a lower number of cores than the one indicated will be used by wapiti.

In the above numbers, I took care to be sure that all the core were actually used - the actual number of examples is large (more than 2000). For instance when training the fulltext model, I am not able to use many cores when training because there a too few examples.

jribault commented 5 years ago

👍

kermitt2 / grobid

header model training slow #431