lyutyuh / ASP

PyTorch implementation and pre-trained models for ASP - Autoregressive Structured Prediction with Language Models, EMNLP 22. https://arxiv.org/pdf/2210.14698.pdf
MIT License
100 stars 15 forks source link

Missing entities on data preparation with conll03_to_json.py #8

Open roebbert92 opened 1 year ago

roebbert92 commented 1 year ago

Dear Tianyu Liu,

I really like the paper and the idea! And also thank you for releasing the code base! I am currently working on my master's thesis and I am planning to augment this architecture with knowledge infusion.

While doing so, I encountered an issue with the code to convert the CoNLL03 dataset to the required json structure. In the tables below, you can see that using your code (denoted eth_asp) does not capture 27 entities over the train, dev and test sets. conll03

Your code does not check for entities at the end of the document -> they are not recognized.

I propose the following changes to your code:

          if line == "-DOCSTART- -X- -X- O":  # new doc
                if doc is not None:
                    # when extended is not the same as tokens
                    # mark where to copy from with <extra_id_22> and <extra_id_23>
                    # E.g.
                    # Extract entities such as apple, orange, lemon <extra_id_22> Give me a mango . <extra_id_23>
                    # See ace05_to_json.py for example of extending the input

                    # FIX: missing entities  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
                    if start is not None:
                        doc['entities'].append({
                            "type":
                            current_type,
                            "start":
                            start,
                            "end":
                            idx if idx > start else idx + 1
                        })
                    # FIX: missing entities >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

                    doc["extended"] = doc["tokens"]
                    dataset.append(doc)
                doc = {
                    "tokens": [],  # list of tokens for the model to copy from
                    "extended":
                    [],  # list of input tokens. Prompts, instructions, etc. go here
                    "entities": [
                    ]  # list of dict:{"type": type, "start": start, "end": end}, format: [start, end)
                }
                idx, start = -1, None
                continue

Best regards, Robin