Closed serenalotreck closed 5 months ago
Thanks for pointing this out @serenalotreck - should be a quick fix.
@caufieldjh wondering if this has been fixed? I'm running OntoGPT on a large quantity of documents and the ballooining size of the YAML is severely slowing down my ability to parse it into KGX format -- it takes several hours just to read in the YAML file.
Not related, but the slowness of the import is causing problems -- it seems like ChatGPT is putting in non-allowed unicode characters in random places, which breaks YAML safe_load, but it takes several hours for me to locate each one via trying to read it and having it break again. I'm currently working on trying to find them all preemptively and remove them before trying to read in the YAML file, but it seems like something that shouldn't be happening in the first place. I haven't tried making a small reproducible example (am under a deadline), so I won't open a new issue yet, but wondered if you'd experienced anything similar.
Hi @serenalotreck - going to attempt a fix for this today.
I haven't explicitly seen any issues with GPT emitting weird unicode characters, but it seems inevitable to happen among any sufficiently large collection of extractions, and we've seen something potentially related when extracting from many PubMed entries. I'm going to consider this issue related to #323 as there should be preprocessing to handle it.
OK, please try pulling the most recent repo version and let me know if you're still seeing redundant named entities.
Looks like that fixed it, thanks!
Related to the change introduced in #304.
For each new YAML output document appended to the
output.txt
file, theextracted_object
item is correct (only contains information from the current input doc), but thenamed_entities
object is appended to from the previous document, and so accumulated entities that aren't in the input doc in question.EDIT: Expected behavior: For the
named_entities
item to only contain entities from the current doc.A full example: