Open Swty13 opened 4 years ago
Hi @Swty13 ,
the pretrained models embedding weights has been set to 512 you can refer in following code pytorch_transformers/tokenization_bert.py line no 50 PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {'bert-base-uncased': 512}
try change the below line in following code in run_ner.py(line no 216) and bert.py (line no 79) input_ids = tokenizer.convert_tokens_to_ids(ntokens) to input_ids = tokenizer.tokenize(ntokens)
iam having the same issue i will try to find solution
if you find the solution please let me know
Best kowshik
@kowshik226
Hi, Thanks for replying I will definitely try this. Could you plz help me that what kind of hardware configuration is required to train custom NER model having 1 lakh dataset.
(Currently I have VM server of 32GB and 64GB what configuration should I choose or GPU is must to train BERT model, I am new to BERT so I have no idea about it.)
Thnaks
@Swty13 For the first question
Cons - You might loose some context due to cutting of the sequence at arbitrary position.
For your second question -
You can always train your model on CPU with 1 lakh dataset (I am assuming sentences) on the said RAM.
There is no way to do it without splitting the input, that is because Google released the pretrained version of best with the 512 limitation, and to remove that limitation you would basically have to pretrain bert from scratch, which is unfeasible and would cost lots of money.
i solved it by splitting the input in less then 500 tokens each, always splitting at the closest period.
def predict(ex):
modelDir ="./NamedEntityExtraction/bert/out_base/"
model = Ner(modelDir)
tokenizer = BertTokenizer.from_pretrained(modelDir, do_lower_case=False)
tokens = tokenizer.tokenize(ex)
splitEx=[]
while len(tokens)>500:
idx = tokens[:500][::-1].index('.')
idx = 500-idx
splitEx.append(tokenizer.convert_tokens_to_string(tokens[:idx]))
tokens = tokens[idx:]
splitEx.append(tokenizer.convert_tokens_to_string(tokens))
output = []
for e in splitEx:
output.extend(model.predict(e))
return output
edit, i use 500 instead of 512 just because when i use 512 I sometimes still get an error for some reason, possibly because of additional tokens added by the model itseld
Hi, As BERT tokenization only supports tokenization of sentence upto 512 so if my text length is greater than 512 How can I proceed? I used BertForTokenClassification for entity recognition task but because of my text length is large a warning comes -- "Token indices sequence length is longer than the specified maximum sequence length for this BERT model (527 > 512). Running this sequence through BERT will result in indexing errors". I don't want to trim or truncate my text as it lost the important information I have to pass my whole text. Could you plz suggest my what should I do or Do you have any other idea to implement named entity recognition.
Thanks in advance.