Open fantao766 opened 1 year ago
Hi! If I understand your question, you ran into the same problem that I did (all the links to the mysteriously preprocessed genia dataset are 404s). I found some files here - https://github.com/yhcc/CNN_Nested_NER/tree/master/preprocess/outputs/genia - which I was able to convert with this adjusted method:
def format_data(input, output):
os.makedirs(os.path.dirname(output), exist_ok=True)
entities, docs = 0, 0
with open(output, 'w', encoding='utf-8') as fw, open(input, encoding='utf-8') as fr:
ids = set()
for idx, ln in enumerate(fr):
if ln == '\n':
continue
example = json.loads(ln)
try:
example = convert(example)
except Exception as e:
print(f"Could not convert line {idx}; skipping.")
continue
entities += len(example["entity_types"])
docs += 1
assert example['id'] not in ids
ids.add(example['id'])
fw.write(json.dumps(example) + '\n')
print(f"Entities: {entities}")
print(f"Docs: {docs}")
(Edit: i was off-by-one on the previous code; there are some flawed rows in that dataset but now i'm just skipping them)
Hi @kristinlindquist ! Thank you for your help! I have a question: what was your convert function for GENIA .jsonlines in that case?
Hello, thank you very much for your help, I would also like to ask about the function used to convert the jsonlines, please. Thank you again !
@helleuch The following function worked for me:
def convert_genia(example: Dict) -> Dict:
offset_mapping = []
text = ''
for token in example['tokens']:
if text == '':
offset_mapping.append((0, len(token)))
text += token
else:
text += ' ' + token
offset_mapping.append((len(text) - len(token), len(text)))
entity_types, entity_start_chars, entity_end_chars = [], [], []
for ann in example['entity_mentions']:
start = ann["start"]
end = ann["end"]
entity_type = ann["entity_type"]
start, end = offset_mapping[start - 1][0], offset_mapping[end - 1][1]
entity_types.append(entity_type)
entity_start_chars.append(start)
entity_end_chars.append(end)
start_words, end_words= zip(*offset_mapping)
return {
'text': text,
'entity_types': entity_types,
'entity_start_chars': entity_start_chars,
'entity_end_chars': entity_end_chars,
'id': example['sent_id'],
'word_start_chars': start_words,
'word_end_chars': end_words
}
... and the according changes in the main function:
def main(args):
if args.task == "conll2003":
convert = convert_conll2003
elif args.task == "genia":
convert = convert_genia
else:
convert = convert_default
os.makedirs(os.path.dirname(args.output), exist_ok=True)
entities, docs = 0, 0
with open(args.output, 'w', encoding='utf-8') as fw, open(args.input, encoding='utf-8') as fr:
ids = set()
for idx, ln in enumerate(fr):
if ln == '\n':
continue
example = json.loads(ln)
print(example)
try:
example = convert(example)
except Exception as e:
print(f"Could not convert line {idx}; skipping.")
continue
entities += len(example["entity_types"])
docs += 1
assert example['id'] not in ids
ids.add(example['id'])
fw.write(json.dumps(example) + '\n')
print(f"Entities: {entities}")
print(f"Docs: {docs}")
@Veranchos Thank you very much !
Hello, I'm so shocked by your brilliant insight and hugely interested in you model and innovation point of this paper. There are some errors When I ran these codes from github.com. But I have no ideas to cope with these mistakes.Therefore, I would like to request one question: what are your train.json formats of genia in datasets? Because I can't run data_preprocess in the begining, so I try to use my own data_preprocess code to generate train.json ... , but unfortunately it'doesn't work. Hope to get an answer. Thanks~!