Open wangxinyu0922 opened 4 years ago
By the way, is the batch split according to the jsonlines file? So what is the batch size?
So I use the one document as a batch whenever possible, for Spanish, there is no document splitting I use 25 sentences per batch. I think the batch size does not matter that much you can feed whatever size you want as long as the GPU memory is enough. For conll12 data, I used the converter from https://github.com/kentonl/e2e-coref/blob/master/minimize.py and convert the NERs to sentences level myself. For conll 2002 and 2003 your can use the code blow to create the jsonlines files:
import json,os
conll_path = '/Users/juntao/Work/corpus/Shared_Tasks/CoNLL_2003_NER/data/'
lang = 'eng'
conll = 'conll03'
#conll = 'conll02'
ner_types = set()
ner_lens = [0,0,0,0,0]
word_lens = [0,0,0,0]
max_word_len = 0
for dset in ['train','dev','test']:
num_doc,num_sent,num_ner,num_token = 0,0,0,0
file = 'train' if dset == 'train' else ('testa' if dset=='dev' else 'testb')
writer = open('doc_level_json/%s.%s.%s.jsonlines'%(dset,lang,conll),'wb')
part_id = 0
sentences = []
lemmas = []
lem = []
sent = []
ner = []
pre_len = 0
pre = ''
start = -1
reader = open(os.path.join(conll_path,'%s.%s'%(lang,file)),'rb').readlines()
global_length = len(reader)
for lid, line in enumerate(reader):
line = line.strip()
if len(line)==0:
if len(sent) > 0:
if start >= 0:
ner.append([[start, len(sent) + pre_len - 1, pre[2:]]])
ner_types.add(pre[2:])
#print sent
assert len(lem) == len(sent)
sentences.append(sent)
lemmas.append(lem)
pre_len+=len(sent)
sent = []
lem = []
pre = ''
start=-1
else:
line = line.split()
if not line[0] == '-DOCSTART-':
if line[-1][:1]=='B' or (line[-1] != pre and not (pre[:1] == 'B' and line[-1][:1] == 'I' and pre[1:] == line[-1][1:])):
if start >=0:
ner.append([[start, len(sent)+pre_len-1,pre[2:]]])
ner_types.add(pre[2:])
start = len(sent)+pre_len if line[-1] != 'O' else -1
pre = line[-1]
sent.append(line[0])
max_word_len = max(len(line[0]),max_word_len)
wind = min(len(line[0])/10, len(word_lens)-1)
word_lens[wind]+=1
lem.append(line[1].split('|')[0].lower())
#for spanish
# if lid == global_length - 1 or (len(line) == 0 and len(sentences) == 25):
#for all other
if lid == global_length - 1 or (len(line)>0 and line[0] == '-DOCSTART-'):
if len(sentences) == 0:
continue
writer.write(json.dumps({
'doc_key':'%s_%d' % (dset,part_id),
'sentences': sentences,
# 'lemmas':lemmas,
'clusters':ner
}))
writer.write('\n')
#print sentences
num_doc += 1
num_sent += len(sentences)
num_ner += len(ner)
num_token += sum(len(s) for s in sentences)
sentences = []
lemmas = []
ner = []
assert sent == [], lem ==[]
pre_len = 0
pre = ''
part_id += 1
print dset,num_doc,num_sent,num_ner,num_token,'%.2f' %(num_ner*100.0/num_token)
print '[\"%s\"]' % '\",\"'.join(list(ner_types))
print word_lens, [w*100.0/sum(word_lens) for w in word_lens]
print max_word_len
It seems that the conversion script is not correct for determining the spans of named entities. The original conll file contains the BIO formatting for the named entities but the conversion script does not recognize it, which results in a wrong dataset in jsonlines format. I think this is the reason why I got inferior accuracy in #8 and #16 ......
@wangxinyu0922 yes you are right the code is out of date, I forgot that I shared the code with you:) The code was corrected after that. But for conll03 English, even using the wrong code the impact is minimal I remember there is only 5 wrong named mentions generated by this script on the test set. Anyway, I updated the code above the output is identical to #16.
I compared the output jsonlines file with the newest version. I find that there are 5846 sentences with different named entites in the training set. Maybe the script with only 5 wrong named mentions is another version. Anyway, I can train the biaffine ner model correctly now.
@wangxinyu0922 that's odd, I must copy some very old code there, sorry about the wrong code. But glad you now can train the model correctly :)
@wangxinyu0922 if you just need eng_conll2003, you can use my script mentioned in this issue though it's not elegant.
@zhaoxf4 Sure, I tried your dataset and successfully reproduce the accuracy of the best model
Hi, I'm very instersted in your great work. For simpler running the code, are there any conversion script that convert the
.conll
file to.jsonlines
files?Thank you!