PyTorch implementation and pre-trained models for ASP - Autoregressive Structured Prediction with Language Models, EMNLP 22. https://arxiv.org/pdf/2210.14698.pdf
MIT License
100
stars
15
forks
source link
Missing entities on data preparation with conll03_to_json.py #8
I really like the paper and the idea! And also thank you for releasing the code base!
I am currently working on my master's thesis and I am planning to augment this architecture with knowledge infusion.
While doing so, I encountered an issue with the code to convert the CoNLL03 dataset to the required json structure.
In the tables below, you can see that using your code (denoted eth_asp) does not capture 27 entities over the train, dev and test sets.
Your code does not check for entities at the end of the document -> they are not recognized.
I propose the following changes to your code:
if line == "-DOCSTART- -X- -X- O": # new doc
if doc is not None:
# when extended is not the same as tokens
# mark where to copy from with <extra_id_22> and <extra_id_23>
# E.g.
# Extract entities such as apple, orange, lemon <extra_id_22> Give me a mango . <extra_id_23>
# See ace05_to_json.py for example of extending the input
# FIX: missing entities <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
if start is not None:
doc['entities'].append({
"type":
current_type,
"start":
start,
"end":
idx if idx > start else idx + 1
})
# FIX: missing entities >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
doc["extended"] = doc["tokens"]
dataset.append(doc)
doc = {
"tokens": [], # list of tokens for the model to copy from
"extended":
[], # list of input tokens. Prompts, instructions, etc. go here
"entities": [
] # list of dict:{"type": type, "start": start, "end": end}, format: [start, end)
}
idx, start = -1, None
continue
Dear Tianyu Liu,
I really like the paper and the idea! And also thank you for releasing the code base! I am currently working on my master's thesis and I am planning to augment this architecture with knowledge infusion.
While doing so, I encountered an issue with the code to convert the CoNLL03 dataset to the required json structure. In the tables below, you can see that using your code (denoted eth_asp) does not capture 27 entities over the train, dev and test sets.
Your code does not check for entities at the end of the document -> they are not recognized.
I propose the following changes to your code:
Best regards, Robin