Closed de-code closed 2 years ago
Hello !
Yes this is correct, thanks for pointing this issue!
At some early point I used this as a fixed max_sequence_length
for every batch, but then use a dynamic sequence length which is the max of the sequence lengths in the current batch (in average this dynamic length is smaller than an arbitrary max length, so we save GPU memory).
I kept it to have the possibility to force a maximum length to truncate the input sequence as a safety limit - however I have forgotten to add it!
There is actually a default max window in preprocess.py
fixed at maxlen=300
(in the tovector* functions) which should be set to this max_sequence_length
- but this default value is not used, we pass the dynamically set max sequence length for each batch.
As you can see, it's not related to a sliding windows as we have in CRF. With CRF, we always have a contextual window centered on the current "target" token which will be labelled, and the CRF template is used to determine the size of the window.
With the DL approach, we have a prediction on a complete sequence, without sliding window and the whole sequence is involved when training the weights or outputting something. For very large input sequence like the header model, it's of course an issue (size of the input could be more than 1000 tokens in worst cases) - but it's potentially also where it is interesting because the "recursive" aspect of the RNN makes the backpropagation potentially impacting the complete sequence.
It would be indeed interesting to compare the "traditional" global full sequence network approach and a local sliding-window network, though I am not sure how to do it. It would require some review to see how it was approached by other people.
Hello !
Yes this is correct, thanks for pointing this issue!
At some early point I used this as a fixed
max_sequence_length
for every batch, but then use a dynamic sequence length which is the max of the sequence lengths in the current batch (in average this dynamic length is smaller than an arbitrary max length, so we save GPU memory).
[...]
There is actually a default max window in
preprocess.py
fixed atmaxlen=300
(in the tovector* functions) which should be set to thismax_sequence_length
- but this default value is not used, we pass the dynamically set max sequence length for each batch.
I'm trying to understand the principle when the max_lenght is dynamically calculated. This is in the data_generator.py:78
? It seems to me that it's adapted at every iteration. Only if the new sequence is bigger. Is this normal?
max_length
is re-calculated at every batch.
max_sequence_length
should be applied to truncated the sequence length when it is too large (instead using the fixed maxlen=300
as it is now).
Another detail I note not to forget: the text classification part still use the name maxlen
, and that should be updated everywhere as max_sequence_length
.
max_length
is re-calculated at every batch.
max_sequence_length
should be applied to truncated the sequence length when it is too large (instead using the fixedmaxlen=300
as it is now).
ah yes, sorry I overlooked.
Another detail I note not to forget: the text classification part still use the name
maxlen
, and that should be updated everywhere asmax_sequence_length
.
Indeed
it will be updated in the branch bert-sequence-labeling
which adds bert architecture for sequence labelling (it's already available for classification since a while), so you can ignore the max_sequence_length
/maxlen
for the moment.
it will be updated in the branch
bert-sequence-labeling
which adds bert architecture for sequence labelling (it's already available for classification since a while), so you can ignore themax_sequence_length
/maxlen
for the moment.
OK. No problem. 🙂
Is there an issue / a list of things to be done in the bert branch?
Is there an issue / a list of things to be done in the bert branch?
Nope, the branch should be complete - so it also fixes this issue (I've tested it with various max_sequence_length
values), issue #21 and issue #61 as bonus :)
Fix with the merging of branch bert-sequence-labeling
Just looking at it. I can see there is a change in the delft.sequenceLabelling.DataGenerator for when tokenize is True
. But if I am not mistaken, tokenize
may only be set to True
by the Tagger.
There is also something specifically for the BERT_Sequence model.
It would appear that it otherwise is not truncating the text for already tokenized text, but it is possible that I am missing something?
In my copy of the DataGenerator I have now also applied the max_sequence_length to already tokenized text (see PR). Although this could probably be applied already earlier on, e.g. when reading the input. But then need to be careful that it doesn't affect validation.
I just reopen it since we are discussing it. Otherwise, is hard to find back.
What about rewriting it like this?
# tokenize texts in self.x if not already done
max_length_x = 0
x_tokenized = []
for i in range(0, max_iter):
if self.tokenize:
tokens = tokenizeAndFilterSimple(sub_x[i])
else:
tokens = sub_x[i]
if len(tokens) > self.max_sequence_length:
max_length_x = self.max_sequence_length
# truncation of sequence at max_sequence_length
tokens = tokens[:self.max_sequence_length]
elif len(tokens) > max_length_x:
max_length_x = len(tokens)
x_tokenized.append(tokens)
You will need to calculate max_length_x
first. Also don't forget batch_y
(and soon batch_features
) as I did.
This is what I ended up with so far: https://github.com/elifesciences/sciencebeam-trainer-delft/blob/c31f97433243a2b0a66671c0dd3e652dcd306362/sciencebeam_trainer_delft/sequence_labelling/data_generator.py#L102-L118
Training is now significantly faster and no memory issues anymore. Although I haven't evaluated it yet.
Oh... I forgot to use the brain I'm afraid... 😓
Let's discuss in the PR #89
Ah yes thank you very much @de-code, I forgot to apply max_sequence_length
in the training case (that looks very stupid).
No worries. Actually I had to ensure that the max_sequence_length
from the trained model is not used by the tagger. Otherwise GROBID isn't happy.
No worries. Actually I had to ensure that the
max_sequence_length
from the trained model is not used by the tagger. Otherwise GROBID isn't happy.
what do you mean? That you truncate only when training?
Possibly the header sequence input to be tagged is too large and when truncated it results in some alignment problems in GROBID (due to the missing truncated end not present in the labeled data, but expected in the tokenizations on the grobid side)?
what do you mean? That you truncate only when training?
Yes, for now. Due to an issue in the segmentation model, I am getting up to 10k tokens in the header training data. Although I think I also had memory issues with GROBID's default training data. I will be looking into sliding windows.
For prediction I didn't experience memory issues. Which may be because it doesn't need to allocate all of the shapes. For prediction via GROBID I am using the CPU only.
Possibly the header sequence input to be tagged is too large and when truncated it results in some alignment problems in GROBID (due to the missing truncated end not present in the labeled data, but expected in the tokenizations on the grobid side)?
Yes, it was expecting "more" and is throwing IndexOutOfBoundsException (see https://github.com/elifesciences/sciencebeam-trainer-delft/issues/154)
I will experiment with sliding windows as one of my next steps.
If I'm not wrong this can be closed, unless there are some more tests to do, IMHO
If I'm not wrong this can be closed, unless there are some more tests to do, IMHO
Maybe the one issue that could be considered part of this is that the maximum length used to train should not necessarily be used for tagging. (I believe that is how it is currently implemented) That would fail GROBID.
Yes, I just had this problem, if the sequences are truncated in the tagging, then grobid will crash with IndexOutOfBounds. The fix is in #97 .
We need to fix it in Grobid then I think. It's normal to truncate when tagging at max max_sequence_length
used during the training, or truncate less. It's mandatory with BERT.
So Grobid has to avoid looking beyond max_sequence_length
of the model it uses when tagged results are coming back.
Personally I would think that max_sequence_length
is a more a technical aspect of the model and it is the model's responsibility to provide labels for all of the data.
Additionally the max_sequence_length
used for training doesn't necessarily need to be the same as for prediction.
I would also be careful that this affects the evaluation in a "positive" way. i.e. a model with max_sequence_length
set to 10
shouldn't be evaluated to be better than one where it is set to 1000
(just because the problem became easier).
There are a few options:
max_sequence_length
when tagging: this is not ideal, but from the practical point of view the memory requirement at prediction seem to be much lower and allows for a larger max_sequence_length
. (Perhaps with an optional separate max_sequence_length
configuration for tagging, but it would come with the similar issues as using the one from training)@de-code I would say yes and no :)
For something like Grobid where we really abstract the process of sequence labelling, I think we indeed don't want to deal with the issue of the max sequence length (which does not exist for the graphical models like CRF). So we expect to have labels and it's up to the underlying sequence labelling "engine" to ensure that.
However as in practice with DL we cannot ensure that currently, from the client/Grobid point of view it's also normal to benchmark the algorithms as they are for the different tasks, and if a task uses long sequences and results are crap, it's a fair evaluation as compared to other methods which have no flaw with long sequences. I think it's what you are saying, not to influence in a positive way the benchmark by only considering small sequence tasks?
For DeLFT itself, I think the problem of max sequence length is so important and constraining with Deep Learning that it has to be made visible to the users. The tasks have to be adapted according to the limit of the method, given that we cannot easily do the contrary right now. Even with sliding windows usage of overlapping sequences or some sort of truncated backpropagation, accuracy and performance will be impacted.
About the options, I have doubts about the first option. Except for BERT models, for prediction we can for sure leave the sequence larger than the max_sequence_length
of the training and diminish the size of the batch (we can do this already in DeLFT, afaik the size of the batch has an impact on training, but not on prediction). But as the model has learnt nothing about sequence beyond max_sequence_length, so we can assume that the tagging beyond will be super crap.
The second option is what is done with Grobid now. I think it's better to make the problem visible to the user in DeLFT as mentioned above.
Third option is going to be very interesting to explore, but I am a bit worry about the impact on accuracy and performance. Some might be introduced in DeLFT as distinct architectures, especially for BERT for instance.
Just some early results which are related to the discussion. This is comparing using the Wapiti vs DL segmentation model on my auto-annotated bioRxiv dataset (trained on around ~1619 documents, validated on 774, end to end validated on 100).
Wapiti:
Evaluation:
f1 (micro): 84.27
precision recall f1-score support
<body> 0.8657 0.8732 0.8694 79969
<header> 0.6856 0.5888 0.6335 5973
<references> 0.7604 0.7574 0.7589 17376
<page> 0.8833 0.8798 0.8815 12014
all (micro avg.) 0.8437 0.8417 0.8427 115332
DL (my CustomBidLSTM_CRF
with features, trained up to 3000 sequences):
Evaluation:
f1 (micro): 88.35
precision recall f1-score support
<page> 0.8110 0.9292 0.8661 11824
<references> 0.7954 0.8401 0.8171 16897
<header> 0.8187 0.7166 0.7642 5973
<body> 0.8961 0.9222 0.9089 79092
all (micro avg.) 0.8677 0.8999 0.8835 113786
page numbers got worse, but better results on the others. Even references
which I haven't auto-annotated yet but left what GROBID had already generated using the default model.
end-to-end evaluation:
The author surnames got a bit worse which seem to be due to line numbers. I actually started with 1800 training documents but see to have lost a few during the conversion to DeLFT (maybe due to the way I had annotated the line numbers). But that's another topic. Just mentioning it to explain the drop.
This seem to suggest that if enough documents contain the relevant fields within the maximum sequence length (here 3000), then we could still get better results even before having implemented sliding windows for training. I also didn't experience any memory issues evaluating or tagging on CPU. (The longest validation sequence is 169257 whereas the median is around 1000)
EDIT: the majority of the author surname drop seems to be due to it extracting authors from the reference section for one of the documents (out of 100). The current evaluation evaluates authors individually, and the reference section provides a lot of incorrectly labelled authors.
This seem to suggest that if enough documents contain the relevant fields within the maximum sequence length (here 3000), then we could still get better results even before having implemented sliding windows for training.
I think that the issue comes from fields that have more than the maximum sequence length.
EDIT: the majority of the author surname drop seems to be due to it extracting authors from the reference section for one of the documents (out of 100). The current evaluation evaluates authors individually, and the reference section provides a lot of incorrectly labelled authors.
This is another model that must take in input a sequence longer than the limit, resulting in possibly bad or incomplete results.
From Delft point of view, IMHO this issue can be closed. As short term plan we better migrate to wapiti all the models that exceed such limit (we might have really bad results) (https://github.com/kermitt2/grobid/pull/559) and we investigate the sliding window more in depth #90
This seem to suggest that if enough documents contain the relevant fields within the maximum sequence length (here 3000), then we could still get better results even before having implemented sliding windows for training.
I think that the issue comes from fields that have more than the maximum sequence length.
Which fields are you thinking of? I have a hunch that it is coming mainly from spacing issues (https://github.com/kermitt2/grobid/issues/564) but I haven't investigated it yet.
EDIT: the majority of the author surname drop seems to be due to it extracting authors from the reference section for one of the documents (out of 100). The current evaluation evaluates authors individually, and the reference section provides a lot of incorrectly labelled authors.
This is another model that must take in input a sequence longer than the limit, resulting in possibly bad or incomplete results.
From Delft point of view, IMHO this issue can be closed. As short term plan we better migrate to wapiti all the models that exceed such limit (we might have really bad results) (kermitt2/grobid#559) and we investigate the sliding window more in depth #90
The issue with the bad reference extraction seems to be mainly due to the line numbers (it went away after I removed them in GROBID). And probably not helped by the current segmentation DeLFT model only using the first token (#99). In my opinion the maximum training sequence is not the same as what should be used for prediction because documents are of variable lengths (and we are using a RNN). Using a maximum length that is equal to the median length will still have seen a whole document about half of the time. I chose something that was three times the median. I guess evaluation will be useful. To summarise my preference would be to leave the option open to use a different maximum length if any.
But I guess the issue can be closed as the maximum sequence length is now used.
I think that the issue comes from fields that have more than the maximum sequence length.
Which fields are you thinking of? I have a hunch that it is coming mainly from spacing issues (kermitt2/grobid#564) but I haven't investigated it yet.
sorry I wrote field
, but I meant model
...
I think this is obsolete and can be closed
yes
Hi @kermitt2
I was just considering whether we need sliding windows to not have to use a really large
max_sequence_length
. But then I realised thatmax_sequence_length
doesn't actually seem to be used. It's passed to theDataGenerator
which doesn't seem to use it. Instead it simply pads the batch to whatever the maximum length is within the batch.