lbnlp / NERRE

Code and data for the publication "Structured information extraction from scientific text with large language models" by Dagdelen & Dunn et al.
MIT License
69 stars 10 forks source link

Missing human annotations #2

Open mehradans92 opened 1 year ago

mehradans92 commented 1 year ago

Hi @ardunn. I am trying to use your datasets that you have created and looks like many of the human annotations are missing from the data/general_results. As an example, many completion keys in the jsonl have empty list values. How did you handle these samples when you were evaluating the model performance? Thanks!

https://github.com/lbnlp/NERRE/tree/main/general_and_mofs/data/general_results/run_0.jsonl

ardunn commented 1 year ago

Hi @mehradans92

For annotation of the general materials task, we do not have extracted entries for texts without a material (root entity around which the other entities are linked) mentioned in the text. Take for example this line in run_2:

{"prompt": "Electrodynamics in superconductors explained by Proca equations\nA fully consistent model to study electrodynamics for superconductors in the stationary and non-stationary regimes has been developed based on Proca equations and a massive photon. In particular, this approach has been applied to study the electric field penetration depth in superconductors. The model shows a deviation from the charge contribution to an internal electric field compared to previous approaches.\n\n###\n\n", "completion": " []\n\nEND\n\n", "gpt3_completion": " []"}

There is no clear root material entity to link to other recognizable materials terms (e.g., superconductors). We do not consider "superconductors" a material because a clear material entity is not present. This is a design choice we made quite early on in developing the General-JSON models.

However, there may be errors. For example, a clearly mentioned material in the text and no corresponding material root entry in the annotation. This would be an error in annotation. If you find them, please point them out and we can correct them!

mehradans92 commented 1 year ago

Thanks for the explanation @ardunn. I have another question on the mof dataset. So I have noticed that sometimes MOF names and their chemical formula are interchangeably used in the human annotations and sometimes not. Was there a reasoning for this?

Also, Andrew mentioned that there have been a few updates to your paper, so I was wondering if there's a way to get access to the latest version. Thanks a lot :)

ardunn commented 1 year ago

Hi @mehradans92 everything should be updated, but do you have some specific examples where this is the case? I am tagging @Andrew-S-Rosen here as well

Andrew-S-Rosen commented 1 year ago

Additional details would be helpful.

That said, it's worth emphasizing that the notion of what constitutes a "name" vs. a "formula" can be rather unclear for MOFs. For instance, take HKUST-1. That's clearly a name. And the equivalent Cu3(btc)2 is clearly a formula. But what about Cu-BTC? I would argue that's a name, but one could potentially argue it's a formula. So, there is certainly room for ambiguity here, although if there are clear errors we should have that addressed.