what are your train.json formats of genia in datasets?

fantao766 commented 1 year ago

Hello, I'm so shocked by your brilliant insight and hugely interested in you model and innovation point of this paper. There are some errors When I ran these codes from github.com. But I have no ideas to cope with these mistakes.Therefore, I would like to request one question: what are your train.json formats of genia in datasets? Because I can't run data_preprocess in the begining, so I try to use my own data_preprocess code to generate train.json ... , but unfortunately it'doesn't work. Hope to get an answer. Thanks~!

kristinlindquist commented 1 year ago

Hi! If I understand your question, you ran into the same problem that I did (all the links to the mysteriously preprocessed genia dataset are 404s). I found some files here - https://github.com/yhcc/CNN_Nested_NER/tree/master/preprocess/outputs/genia - which I was able to convert with this adjusted method:

def format_data(input, output):
    os.makedirs(os.path.dirname(output), exist_ok=True)
    entities, docs = 0, 0
    with open(output, 'w', encoding='utf-8') as fw, open(input, encoding='utf-8') as fr:
        ids = set()
        for idx, ln in enumerate(fr):
            if ln == '\n':
                continue
            example = json.loads(ln)
            try:
              example = convert(example) 
            except Exception as e:
              print(f"Could not convert line {idx}; skipping.")
              continue
            entities += len(example["entity_types"])
            docs += 1
            assert example['id'] not in ids
            ids.add(example['id'])
            fw.write(json.dumps(example) + '\n')
    print(f"Entities: {entities}")
    print(f"Docs: {docs}")

(Edit: i was off-by-one on the previous code; there are some flawed rows in that dataset but now i'm just skipping them)

Veranchos commented 1 year ago

Hi @kristinlindquist ! Thank you for your help! I have a question: what was your convert function for GENIA .jsonlines in that case?

helleuch commented 1 year ago

Hello, thank you very much for your help, I would also like to ask about the function used to convert the jsonlines, please. Thank you again !

Veranchos commented 1 year ago

@helleuch The following function worked for me:

def convert_genia(example: Dict) -> Dict:
     offset_mapping = []
     text = ''
     for token in example['tokens']:
         if text == '':
             offset_mapping.append((0, len(token)))
             text += token
         else:
             text += ' ' + token
             offset_mapping.append((len(text) - len(token), len(text)))
     entity_types, entity_start_chars, entity_end_chars = [], [], []
     for ann in example['entity_mentions']:
         start = ann["start"]
         end = ann["end"]
         entity_type = ann["entity_type"]
         start, end = offset_mapping[start - 1][0], offset_mapping[end - 1][1]
         entity_types.append(entity_type)
         entity_start_chars.append(start)
         entity_end_chars.append(end)
     start_words, end_words= zip(*offset_mapping)
     return {
         'text': text,
         'entity_types': entity_types,
         'entity_start_chars': entity_start_chars,
         'entity_end_chars': entity_end_chars,
         'id': example['sent_id'],
         'word_start_chars': start_words,
         'word_end_chars': end_words
    }

... and the according changes in the main function:

def main(args):
    if args.task == "conll2003":
        convert = convert_conll2003
    elif args.task == "genia":
        convert = convert_genia
    else:
        convert = convert_default
    os.makedirs(os.path.dirname(args.output), exist_ok=True)
    entities, docs = 0, 0
    with open(args.output, 'w', encoding='utf-8') as fw, open(args.input, encoding='utf-8') as fr:
        ids = set()
        for idx, ln in enumerate(fr):
            if ln == '\n':
                continue
            example = json.loads(ln)
            print(example)
            try:
              example = convert(example) 
            except Exception as e:
              print(f"Could not convert line {idx}; skipping.")
              continue
            entities += len(example["entity_types"])
            docs += 1
            assert example['id'] not in ids
            ids.add(example['id'])
            fw.write(json.dumps(example) + '\n')
    print(f"Entities: {entities}")
    print(f"Docs: {docs}")

helleuch commented 1 year ago

@Veranchos Thank you very much !

microsoft / binder

what are your train.json formats of genia in datasets? #8