mim-solutions / bert_for_longer_texts

BERT classification model for processing texts longer than 512 tokens. Text is first divided into smaller chunks and after feeding them to BERT, intermediate results are pooled. The implementation allows fine-tuning.
Other
129 stars 30 forks source link

text length warning #25

Closed cwoonb closed 10 months ago

cwoonb commented 10 months ago

Token indices sequence length is longer than the specified maximum sequence length for this model (2268 > 512). Running this sequence through the model will result in indexing errors Can I ignore the warning notice above? Why is it popping up?

cwoonb commented 10 months ago

ah! and torch.save(model.state_dict(), "outputs/") I don't need to use code, but can you tell me how to save and load the model? I searched on Google and it said I had to inherit from torch.nn, but that didn't work.

mwachnicki commented 10 months ago

Just as in our example - the warning is expected and should be just ignored. https://github.com/mim-solutions/bert_for_longer_texts/blob/main/notebooks/example_model_with_pooling_fit_predict.ipynb

Model's class has its own load and save methods implemented. Look at the example below. https://github.com/mim-solutions/bert_for_longer_texts/blob/main/tests/test_bert_with_pooling.py#L59