max_sequence_length not used(?)

de-code commented 5 years ago

Hi @kermitt2

I was just considering whether we need sliding windows to not have to use a really large max_sequence_length. But then I realised that max_sequence_length doesn't actually seem to be used. It's passed to the DataGenerator which doesn't seem to use it. Instead it simply pads the batch to whatever the maximum length is within the batch.

kermitt2 commented 5 years ago

Hello !

Yes this is correct, thanks for pointing this issue!

At some early point I used this as a fixed max_sequence_length for every batch, but then use a dynamic sequence length which is the max of the sequence lengths in the current batch (in average this dynamic length is smaller than an arbitrary max length, so we save GPU memory).

I kept it to have the possibility to force a maximum length to truncate the input sequence as a safety limit - however I have forgotten to add it! There is actually a default max window in preprocess.py fixed at maxlen=300 (in the tovector* functions) which should be set to this max_sequence_length - but this default value is not used, we pass the dynamically set max sequence length for each batch.

kermitt2 commented 5 years ago

As you can see, it's not related to a sliding windows as we have in CRF. With CRF, we always have a contextual window centered on the current "target" token which will be labelled, and the CRF template is used to determine the size of the window.

With the DL approach, we have a prediction on a complete sequence, without sliding window and the whole sequence is involved when training the weights or outputting something. For very large input sequence like the header model, it's of course an issue (size of the input could be more than 1000 tokens in worst cases) - but it's potentially also where it is interesting because the "recursive" aspect of the RNN makes the backpropagation potentially impacting the complete sequence.

It would be indeed interesting to compare the "traditional" global full sequence network approach and a local sliding-window network, though I am not sure how to do it. It would require some review to see how it was approached by other people.

lfoppiano commented 4 years ago

Hello !

Yes this is correct, thanks for pointing this issue!

At some early point I used this as a fixed max_sequence_length for every batch, but then use a dynamic sequence length which is the max of the sequence lengths in the current batch (in average this dynamic length is smaller than an arbitrary max length, so we save GPU memory).

[...]

There is actually a default max window in preprocess.py fixed at maxlen=300 (in the tovector* functions) which should be set to this max_sequence_length - but this default value is not used, we pass the dynamically set max sequence length for each batch.

I'm trying to understand the principle when the max_lenght is dynamically calculated. This is in the data_generator.py:78? It seems to me that it's adapted at every iteration. Only if the new sequence is bigger. Is this normal?

kermitt2 commented 4 years ago

max_length is re-calculated at every batch.

max_sequence_length should be applied to truncated the sequence length when it is too large (instead using the fixed maxlen=300 as it is now).

kermitt2 commented 4 years ago

Another detail I note not to forget: the text classification part still use the name maxlen, and that should be updated everywhere as max_sequence_length.

lfoppiano commented 4 years ago

max_length is re-calculated at every batch.

max_sequence_length should be applied to truncated the sequence length when it is too large (instead using the fixed maxlen=300 as it is now).

ah yes, sorry I overlooked.

Another detail I note not to forget: the text classification part still use the name maxlen, and that should be updated everywhere as max_sequence_length.

Indeed

kermitt2 commented 4 years ago

it will be updated in the branch bert-sequence-labeling which adds bert architecture for sequence labelling (it's already available for classification since a while), so you can ignore the max_sequence_length/maxlen for the moment.

lfoppiano commented 4 years ago

it will be updated in the branch bert-sequence-labeling which adds bert architecture for sequence labelling (it's already available for classification since a while), so you can ignore the max_sequence_length/maxlen for the moment.

OK. No problem. 🙂

Is there an issue / a list of things to be done in the bert branch?

kermitt2 commented 4 years ago

Is there an issue / a list of things to be done in the bert branch?

Nope, the branch should be complete - so it also fixes this issue (I've tested it with various max_sequence_length values), issue #21 and issue #61 as bonus :)

kermitt2 commented 4 years ago

Fix with the merging of branch bert-sequence-labeling

de-code commented 4 years ago

Just looking at it. I can see there is a change in the delft.sequenceLabelling.DataGenerator for when tokenize is True. But if I am not mistaken, tokenize may only be set to True by the Tagger.

There is also something specifically for the BERT_Sequence model.

It would appear that it otherwise is not truncating the text for already tokenized text, but it is possible that I am missing something?

de-code commented 4 years ago

In my copy of the DataGenerator I have now also applied the max_sequence_length to already tokenized text (see PR). Although this could probably be applied already earlier on, e.g. when reading the input. But then need to be careful that it doesn't affect validation.

lfoppiano commented 4 years ago

I just reopen it since we are discussing it. Otherwise, is hard to find back.

What about rewriting it like this?

 # tokenize texts in self.x if not already done
        max_length_x = 0
        x_tokenized = []
        for i in range(0, max_iter):
            if self.tokenize:
                tokens = tokenizeAndFilterSimple(sub_x[i])
            else:
                tokens = sub_x[i]

            if len(tokens) > self.max_sequence_length:
                max_length_x = self.max_sequence_length
                # truncation of sequence at max_sequence_length
                tokens = tokens[:self.max_sequence_length]
            elif len(tokens) > max_length_x:
                max_length_x = len(tokens)
            x_tokenized.append(tokens)

de-code commented 4 years ago

You will need to calculate max_length_x first. Also don't forget batch_y (and soon batch_features) as I did.

This is what I ended up with so far: https://github.com/elifesciences/sciencebeam-trainer-delft/blob/c31f97433243a2b0a66671c0dd3e652dcd306362/sciencebeam_trainer_delft/sequence_labelling/data_generator.py#L102-L118

Training is now significantly faster and no memory issues anymore. Although I haven't evaluated it yet.

lfoppiano commented 4 years ago

Oh... I forgot to use the brain I'm afraid... 😓

Let's discuss in the PR #89

kermitt2 commented 4 years ago

Ah yes thank you very much @de-code, I forgot to apply max_sequence_length in the training case (that looks very stupid).

de-code commented 4 years ago

No worries. Actually I had to ensure that the max_sequence_length from the trained model is not used by the tagger. Otherwise GROBID isn't happy.

lfoppiano commented 4 years ago

No worries. Actually I had to ensure that the max_sequence_length from the trained model is not used by the tagger. Otherwise GROBID isn't happy.

what do you mean? That you truncate only when training?

kermitt2 commented 4 years ago

Possibly the header sequence input to be tagged is too large and when truncated it results in some alignment problems in GROBID (due to the missing truncated end not present in the labeled data, but expected in the tokenizations on the grobid side)?

de-code commented 4 years ago

what do you mean? That you truncate only when training?

Yes, for now. Due to an issue in the segmentation model, I am getting up to 10k tokens in the header training data. Although I think I also had memory issues with GROBID's default training data. I will be looking into sliding windows.

For prediction I didn't experience memory issues. Which may be because it doesn't need to allocate all of the shapes. For prediction via GROBID I am using the CPU only.

Possibly the header sequence input to be tagged is too large and when truncated it results in some alignment problems in GROBID (due to the missing truncated end not present in the labeled data, but expected in the tokenizations on the grobid side)?

Yes, it was expecting "more" and is throwing IndexOutOfBoundsException (see https://github.com/elifesciences/sciencebeam-trainer-delft/issues/154)

I will experiment with sliding windows as one of my next steps.

lfoppiano commented 4 years ago

If I'm not wrong this can be closed, unless there are some more tests to do, IMHO

de-code commented 4 years ago

If I'm not wrong this can be closed, unless there are some more tests to do, IMHO

Maybe the one issue that could be considered part of this is that the maximum length used to train should not necessarily be used for tagging. (I believe that is how it is currently implemented) That would fail GROBID.

lfoppiano commented 4 years ago

Yes, I just had this problem, if the sequences are truncated in the tagging, then grobid will crash with IndexOutOfBounds. The fix is in #97 .

kermitt2 commented 4 years ago

We need to fix it in Grobid then I think. It's normal to truncate when tagging at max max_sequence_length used during the training, or truncate less. It's mandatory with BERT.

So Grobid has to avoid looking beyond max_sequence_length of the model it uses when tagged results are coming back.

de-code commented 4 years ago

Personally I would think that max_sequence_length is a more a technical aspect of the model and it is the model's responsibility to provide labels for all of the data.

Additionally the max_sequence_length used for training doesn't necessarily need to be the same as for prediction.

I would also be careful that this affects the evaluation in a "positive" way. i.e. a model with max_sequence_length set to 10 shouldn't be evaluated to be better than one where it is set to 1000 (just because the problem became easier).

There are a few options:

do not apply max_sequence_length when tagging: this is not ideal, but from the practical point of view the memory requirement at prediction seem to be much lower and allows for a larger max_sequence_length. (Perhaps with an optional separate max_sequence_length configuration for tagging, but it would come with the similar issues as using the one from training)
pad the predicted labels, e.g. with "other" or "unknown": that at least wouldn't cheat the evaluation
tag the whole sequence, by applying windows (#90): even if the model wasn't trained with such a large sequence it could still output something useful (e.g. just more of the same)

kermitt2 commented 4 years ago

@de-code I would say yes and no :)

For something like Grobid where we really abstract the process of sequence labelling, I think we indeed don't want to deal with the issue of the max sequence length (which does not exist for the graphical models like CRF). So we expect to have labels and it's up to the underlying sequence labelling "engine" to ensure that.

However as in practice with DL we cannot ensure that currently, from the client/Grobid point of view it's also normal to benchmark the algorithms as they are for the different tasks, and if a task uses long sequences and results are crap, it's a fair evaluation as compared to other methods which have no flaw with long sequences. I think it's what you are saying, not to influence in a positive way the benchmark by only considering small sequence tasks?

For DeLFT itself, I think the problem of max sequence length is so important and constraining with Deep Learning that it has to be made visible to the users. The tasks have to be adapted according to the limit of the method, given that we cannot easily do the contrary right now. Even with sliding windows usage of overlapping sequences or some sort of truncated backpropagation, accuracy and performance will be impacted.

About the options, I have doubts about the first option. Except for BERT models, for prediction we can for sure leave the sequence larger than the max_sequence_length of the training and diminish the size of the batch (we can do this already in DeLFT, afaik the size of the batch has an impact on training, but not on prediction). But as the model has learnt nothing about sequence beyond max_sequence_length, so we can assume that the tagging beyond will be super crap.

The second option is what is done with Grobid now. I think it's better to make the problem visible to the user in DeLFT as mentioned above.

Third option is going to be very interesting to explore, but I am a bit worry about the impact on accuracy and performance. Some might be introduced in DeLFT as distinct architectures, especially for BERT for instance.

de-code commented 4 years ago

Just some early results which are related to the discussion. This is comparing using the Wapiti vs DL segmentation model on my auto-annotated bioRxiv dataset (trained on around ~1619 documents, validated on 774, end to end validated on 100).

Wapiti:


Evaluation:
    f1 (micro): 84.27
                  precision    recall  f1-score   support

          <body>     0.8657    0.8732    0.8694     79969
        <header>     0.6856    0.5888    0.6335      5973
    <references>     0.7604    0.7574    0.7589     17376
          <page>     0.8833    0.8798    0.8815     12014

all (micro avg.)     0.8437    0.8417    0.8427    115332

DL (my CustomBidLSTM_CRF with features, trained up to 3000 sequences):

Evaluation:
    f1 (micro): 88.35
                  precision    recall  f1-score   support

          <page>     0.8110    0.9292    0.8661     11824
    <references>     0.7954    0.8401    0.8171     16897
        <header>     0.8187    0.7166    0.7642      5973
          <body>     0.8961    0.9222    0.9089     79092

all (micro avg.)     0.8677    0.8999    0.8835    113786

page numbers got worse, but better results on the others. Even references which I haven't auto-annotated yet but left what GROBID had already generated using the default model.

end-to-end evaluation:

The author surnames got a bit worse which seem to be due to line numbers. I actually started with 1800 training documents but see to have lost a few during the conversion to DeLFT (maybe due to the way I had annotated the line numbers). But that's another topic. Just mentioning it to explain the drop.

This seem to suggest that if enough documents contain the relevant fields within the maximum sequence length (here 3000), then we could still get better results even before having implemented sliding windows for training. I also didn't experience any memory issues evaluating or tagging on CPU. (The longest validation sequence is 169257 whereas the median is around 1000)

EDIT: the majority of the author surname drop seems to be due to it extracting authors from the reference section for one of the documents (out of 100). The current evaluation evaluates authors individually, and the reference section provides a lot of incorrectly labelled authors.

lfoppiano commented 4 years ago

This seem to suggest that if enough documents contain the relevant fields within the maximum sequence length (here 3000), then we could still get better results even before having implemented sliding windows for training.

I think that the issue comes from fields that have more than the maximum sequence length.

EDIT: the majority of the author surname drop seems to be due to it extracting authors from the reference section for one of the documents (out of 100). The current evaluation evaluates authors individually, and the reference section provides a lot of incorrectly labelled authors.

This is another model that must take in input a sequence longer than the limit, resulting in possibly bad or incomplete results.

From Delft point of view, IMHO this issue can be closed. As short term plan we better migrate to wapiti all the models that exceed such limit (we might have really bad results) (https://github.com/kermitt2/grobid/pull/559) and we investigate the sliding window more in depth #90

de-code commented 4 years ago

This seem to suggest that if enough documents contain the relevant fields within the maximum sequence length (here 3000), then we could still get better results even before having implemented sliding windows for training.

I think that the issue comes from fields that have more than the maximum sequence length.

Which fields are you thinking of? I have a hunch that it is coming mainly from spacing issues (https://github.com/kermitt2/grobid/issues/564) but I haven't investigated it yet.

EDIT: the majority of the author surname drop seems to be due to it extracting authors from the reference section for one of the documents (out of 100). The current evaluation evaluates authors individually, and the reference section provides a lot of incorrectly labelled authors.

This is another model that must take in input a sequence longer than the limit, resulting in possibly bad or incomplete results.

From Delft point of view, IMHO this issue can be closed. As short term plan we better migrate to wapiti all the models that exceed such limit (we might have really bad results) (kermitt2/grobid#559) and we investigate the sliding window more in depth #90

The issue with the bad reference extraction seems to be mainly due to the line numbers (it went away after I removed them in GROBID). And probably not helped by the current segmentation DeLFT model only using the first token (#99). In my opinion the maximum training sequence is not the same as what should be used for prediction because documents are of variable lengths (and we are using a RNN). Using a maximum length that is equal to the median length will still have seen a whole document about half of the time. I chose something that was three times the median. I guess evaluation will be useful. To summarise my preference would be to leave the option open to use a different maximum length if any.

But I guess the issue can be closed as the maximum sequence length is now used.

lfoppiano commented 4 years ago

I think that the issue comes from fields that have more than the maximum sequence length.

Which fields are you thinking of? I have a hunch that it is coming mainly from spacing issues (kermitt2/grobid#564) but I haven't investigated it yet.

sorry I wrote field, but I meant model...

lfoppiano commented 2 years ago

I think this is obsolete and can be closed

kermitt2 commented 2 years ago

yes

kermitt2 / delft

max_sequence_length not used(?) #44