How can we use model BertForTokenClassification for lengthy sentences?

Swty13 commented 4 years ago

Hi, As BERT tokenization only supports tokenization of sentence upto 512 so if my text length is greater than 512 How can I proceed? I used BertForTokenClassification for entity recognition task but because of my text length is large a warning comes -- "Token indices sequence length is longer than the specified maximum sequence length for this BERT model (527 > 512). Running this sequence through BERT will result in indexing errors". I don't want to trim or truncate my text as it lost the important information I have to pass my whole text. Could you plz suggest my what should I do or Do you have any other idea to implement named entity recognition.

Thanks in advance.

kowshik226 commented 4 years ago

Hi @Swty13 ,

the pretrained models embedding weights has been set to 512 you can refer in following code pytorch_transformers/tokenization_bert.py line no 50 PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {'bert-base-uncased': 512}

try change the below line in following code in run_ner.py(line no 216) and bert.py (line no 79) input_ids = tokenizer.convert_tokens_to_ids(ntokens) to input_ids = tokenizer.tokenize(ntokens)

iam having the same issue i will try to find solution

if you find the solution please let me know

Best kowshik

Swty13 commented 4 years ago

@kowshik226

Hi, Thanks for replying I will definitely try this. Could you plz help me that what kind of hardware configuration is required to train custom NER model having 1 lakh dataset.

(Currently I have VM server of 32GB and 64GB what configuration should I choose or GPU is must to train BERT model, I am new to BERT so I have no idea about it.)

Thnaks

ranjeetds commented 4 years ago

@Swty13 For the first question

cut your input into sections with 512 tokens and pass them iteratively for inference. (I have done implementation for the same)

Cons - You might loose some context due to cutting of the sequence at arbitrary position.

For your second question -

You can always train your model on CPU with 1 lakh dataset (I am assuming sentences) on the said RAM.

raff7 commented 4 years ago

There is no way to do it without splitting the input, that is because Google released the pretrained version of best with the 512 limitation, and to remove that limitation you would basically have to pretrain bert from scratch, which is unfeasible and would cost lots of money.

i solved it by splitting the input in less then 500 tokens each, always splitting at the closest period.

def predict(ex):
    modelDir ="./NamedEntityExtraction/bert/out_base/"
    model = Ner(modelDir)
    tokenizer = BertTokenizer.from_pretrained(modelDir, do_lower_case=False)
    tokens = tokenizer.tokenize(ex)
    splitEx=[]
    while len(tokens)>500:
        idx = tokens[:500][::-1].index('.')
        idx = 500-idx
        splitEx.append(tokenizer.convert_tokens_to_string(tokens[:idx]))
        tokens = tokens[idx:]

    splitEx.append(tokenizer.convert_tokens_to_string(tokens))
    output = []
    for e in splitEx:   
        output.extend(model.predict(e))
    return output

edit, i use 500 instead of 512 just because when i use 512 I sometimes still get an error for some reason, possibly because of additional tokens added by the model itseld

kamalkraj / BERT-NER

How can we use model BertForTokenClassification for lengthy sentences? #77