juntaoy / biaffine-ner

Named Entity Recognition as Dependency Parsing
Apache License 2.0
350 stars 39 forks source link

Conversion script #7

Open wangxinyu0922 opened 4 years ago

wangxinyu0922 commented 4 years ago

Hi, I'm very instersted in your great work. For simpler running the code, are there any conversion script that convert the .conll file to .jsonlines files?

Thank you!

wangxinyu0922 commented 4 years ago

By the way, is the batch split according to the jsonlines file? So what is the batch size?

juntaoy commented 4 years ago

So I use the one document as a batch whenever possible, for Spanish, there is no document splitting I use 25 sentences per batch. I think the batch size does not matter that much you can feed whatever size you want as long as the GPU memory is enough. For conll12 data, I used the converter from https://github.com/kentonl/e2e-coref/blob/master/minimize.py and convert the NERs to sentences level myself. For conll 2002 and 2003 your can use the code blow to create the jsonlines files:

import json,os

conll_path = '/Users/juntao/Work/corpus/Shared_Tasks/CoNLL_2003_NER/data/'
lang = 'eng'
conll = 'conll03'
#conll = 'conll02'
ner_types = set()
ner_lens = [0,0,0,0,0]
word_lens = [0,0,0,0]
max_word_len = 0
for dset in ['train','dev','test']:
  num_doc,num_sent,num_ner,num_token = 0,0,0,0
  file = 'train' if dset == 'train' else ('testa' if dset=='dev' else 'testb')
  writer = open('doc_level_json/%s.%s.%s.jsonlines'%(dset,lang,conll),'wb')
  part_id = 0
  sentences = []
  lemmas = []
  lem = []
  sent = []
  ner = []
  pre_len = 0
  pre = ''
  start = -1

  reader = open(os.path.join(conll_path,'%s.%s'%(lang,file)),'rb').readlines()
  global_length = len(reader)

  for lid, line in enumerate(reader):
    line = line.strip()
    if len(line)==0:
      if len(sent) > 0:
        if start >= 0:
          ner.append([[start, len(sent) + pre_len - 1, pre[2:]]])
          ner_types.add(pre[2:])
        #print sent
        assert len(lem) == len(sent)
        sentences.append(sent)
        lemmas.append(lem)
        pre_len+=len(sent)
        sent = []
        lem = []
        pre = ''
        start=-1
    else:
      line = line.split()
      if not line[0] == '-DOCSTART-':
        if line[-1][:1]=='B' or (line[-1] != pre and not (pre[:1] == 'B' and line[-1][:1] == 'I' and pre[1:] == line[-1][1:])):
          if start >=0:
            ner.append([[start, len(sent)+pre_len-1,pre[2:]]])
            ner_types.add(pre[2:])
          start = len(sent)+pre_len if line[-1] != 'O' else -1
          pre = line[-1]
        sent.append(line[0])
        max_word_len = max(len(line[0]),max_word_len)
        wind = min(len(line[0])/10, len(word_lens)-1)
        word_lens[wind]+=1
        lem.append(line[1].split('|')[0].lower())

    #for spanish
    # if lid == global_length - 1 or (len(line) == 0 and len(sentences) == 25):
    #for all other
    if lid == global_length - 1 or (len(line)>0 and line[0] == '-DOCSTART-'):
      if len(sentences) == 0:
        continue
      writer.write(json.dumps({
        'doc_key':'%s_%d' % (dset,part_id),
        'sentences': sentences,
        # 'lemmas':lemmas,
        'clusters':ner
      }))
      writer.write('\n')
      #print sentences
      num_doc += 1
      num_sent += len(sentences)
      num_ner += len(ner)
      num_token += sum(len(s) for s in sentences)

      sentences = []
      lemmas = []
      ner = []
      assert sent == [], lem ==[]
      pre_len = 0
      pre = ''
      part_id += 1

  print dset,num_doc,num_sent,num_ner,num_token,'%.2f' %(num_ner*100.0/num_token)
  print '[\"%s\"]' % '\",\"'.join(list(ner_types))
  print word_lens, [w*100.0/sum(word_lens) for w in word_lens]
  print max_word_len
wangxinyu0922 commented 4 years ago

It seems that the conversion script is not correct for determining the spans of named entities. The original conll file contains the BIO formatting for the named entities but the conversion script does not recognize it, which results in a wrong dataset in jsonlines format. I think this is the reason why I got inferior accuracy in #8 and #16 ......

juntaoy commented 4 years ago

@wangxinyu0922 yes you are right the code is out of date, I forgot that I shared the code with you:) The code was corrected after that. But for conll03 English, even using the wrong code the impact is minimal I remember there is only 5 wrong named mentions generated by this script on the test set. Anyway, I updated the code above the output is identical to #16.

wangxinyu0922 commented 4 years ago

I compared the output jsonlines file with the newest version. I find that there are 5846 sentences with different named entites in the training set. Maybe the script with only 5 wrong named mentions is another version. Anyway, I can train the biaffine ner model correctly now.

juntaoy commented 4 years ago

@wangxinyu0922 that's odd, I must copy some very old code there, sorry about the wrong code. But glad you now can train the model correctly :)

zhaoxf4 commented 4 years ago

@wangxinyu0922 if you just need eng_conll2003, you can use my script mentioned in this issue though it's not elegant.

wangxinyu0922 commented 4 years ago

@zhaoxf4 Sure, I tried your dataset and successfully reproduce the accuracy of the best model