thunlp / PL-Marker

Source code for "Packed Levitated Marker for Entity and Relation Extraction"
MIT License
260 stars 35 forks source link

Question about the Quick Start #35

Closed Zephyr1022 closed 1 year ago

Zephyr1022 commented 2 years ago

Hello, I was curious that in the Quick Start section, what does this "--max_mention_ori_length: 8" mean? If I run the different dataset, should I change it based on my data size? Thanks.

YeDeming commented 2 years ago

We enumerate all candidate spans whose lengths are no greater than --max_mention_ori_length. You can adjust it to fit the mention length of you data.

Zephyr1022 commented 2 years ago

We enumerate all candidate spans whose lengths are no greater than --max_mention_ori_length. You can adjust it to fit the mention length of you data.

Thank you so much for your reply. I have one more question that I want to extract the overlap ner. Do you have any suggestions which model would be better to use, run_acener.py or run_ner.py?

YeDeming commented 2 years ago

run_acener.py is used for overlap ner. run_ner.py is for the non-overlap ner.

Zephyr1022 commented 2 years ago

Thanks a lot. I used model run_acener.py to train on my clinical ner data. But the result is not very good. I got the high recall and very low precision. I was curious that does any default hyperparameter would affect the result?

"dev_bestf1": 0.0746615905245347, "f1": 0.07554758410783194, "f1overlap": 0.017645128284329813, "precision": 0.04264850270004909, "recall": 0.3304802662862577

YeDeming commented 2 years ago

Can you reproduce our result on SciERC dataset in your PC?

Zephyr1022 commented 2 years ago

Yep, the SciERC dataset works well on my server. But when I apply the same code to my clinical ner data, the f1 is always below 0.3. I tried the different learning rate or models or increasing epoch to 50. The performance is very stable, around 0.3. Here is the sample of my data: I was curious that did I do any wrong to preprocessing the json data?

{"doc_key": "./mimic/03.txt", "sentences": [["SOCIAL", "HISTORY", ":", "Lives", "with", "his", "caring", "and", "devoted", "parents", "at", "home", "."], ["Enjoys", "movies", "and", "computers", "."], ["No", "history", "of", "alcohol", ",", "tobacco", "or", "drug", "use", "."]], "ner": [[[3, 3, "StatusTime"], [4, 9, "TypeLiving"]], [], [[18, 19, "StatusTime"]]], "relations": [[[3, 3, 3, 3, "LivingStatus-Status"], [3, 3, 4, 9, "LivingStatus-Type"]], [], [[23, 23, 18, 19, "Tobacco-Status"], [25, 26, 18, 19, "Drug-Status"], [21, 21, 18, 19, "Alcohol-Status"]]]}

YeDeming commented 2 years ago

Do you modify the number of labels? https://github.com/thunlp/PL-Marker/blob/07fde08d868134ced1d861d17d263d6c782bb420/run_acener.py#L939-L946

Are the entities in this example: Lives: StatusTime

Zephyr1022 commented 2 years ago

Yeah, I changed the labels and num_labels in the code.

The entities are as follows: StatusTime: Lives StatusTime: No history TypeLiving: with his caring and devoted parents

YeDeming commented 2 years ago

Sorry. I have no idea to solve your problem.

Zephyr1022 commented 2 years ago

Thank you for your time and effort to help me look at it. Something strange may happen in the code on my server.