syuoni / eznlp

Easy Natural Language Processing
Apache License 2.0
130 stars 21 forks source link

DocRED Joint extraction (sequence length problem, subtokens) #37

Closed SylvainVerdy closed 1 year ago

SylvainVerdy commented 1 year ago

Hi,

Thanks a lot for this amazing framework.

I'm working on the deep span representations from ACL2023. I have already adapted to conll2004. I'm trying to adapt the model on docred dataset. I'm facing to a sequence length problem.

/miniconda3/envs/eznlp/lib/python3.8/site-packages/eznlp/model/bert_like.py", line 121, in _token_ids_from_tokenized assert len(sub_tokens) <= self.tokenizer.model_max_length - 2 AssertionError model use : distilroberta-base Do you have an idea for solving this problem?

Best regards,

Sylvain

syuoni commented 1 year ago

Hi Sylvain,

It seems that your input sequence is so long that the subword tokens exceed the model's maximum input length. As you may know, BERT/RoBERTa only accepts subword sequence no longer than 512.

You may truncate the over-long input sequence, or segment the sequence into shorter ones, so that the model can accept the input.

Best, Enwei