richardpaulhudson / holmes-extractor

Information extraction from English and German texts based on predicate logic
MIT License
134 stars 12 forks source link

issue with ontology with multi word mapping. #11

Closed jdixosnd closed 1 year ago

jdixosnd commented 1 year ago

I tried creating a custom OWL file and added mapping of 'Submit' with 'Hand over' and the mapping looks like this,

 <owl:Class rdf:about="http://www.semanticweb.org/dummy/ontologies/2022/10/myontology#hand_over">
        <owl:equivalentClass rdf:resource="http://www.semanticweb.org/dummy/ontologies/2022/10/myontology#submit"/>
    </owl:Class>

however, even if I pass a sentence that just contains the word 'Hand' it gets the match. which I think is incorrect. 'hand'->'submit' (Is a synonym of SUBMIT in the ontology) P.S. The library is awesome!

richardpaulhudson commented 1 year ago

I'm glad you're finding Holmes useful. So I can reproduce the issue, could you please post a code snippet showing what you are doing: whether you are using structural matching or topic matching and the search phrase or search query with which this is occurring.

jdixosnd commented 1 year ago

Thank you so much for your response. Following is the example that I am trying to execute. Ontology Mappings in myontologies.owl: "periodically" is similar to "time-to-time" and "time to time"

import holmes_extractor as holmes
from holmes_extractor.lang.en.language_specific_rules import LanguageSpecificSemanticMatchingHelper
LanguageSpecificSemanticMatchingHelper.permissible_embedding_pos.append("VERB")
ontology = holmes.Ontology("myontologies.owl")
holmes_manager = holmes.Manager(model='en_core_web_trf',ontology=ontology,overall_similarity_threshold=0.4)

searchPhrases = ["ENTITYNOUN shall meet time-to-time", "ENTITYNOUN shall meet from time-to-time"]

holmes_manager.remove_all_search_phrases()
for searchPhrase in searchPhrases:
    holmes_manager.register_search_phrase(searchPhrase)

#ISSUES
documents=[  "Members shall meet periodically during production",    # RETURNS [], it should match periodically with time-to-time
                      "Members shall meet time to time during production",    # RETURNS match with the explanation "Matches TIME directly", should match 'time to time' with 'time-to-time'
                      "Members shall meet time-to-time during production"]    # RETURNS match with the explanation "Matches TIME directly", should match 'time-to-time' directly

for txt in documents:
    holmes_manager.remove_all_documents()
    holmes_manager.parse_and_register_document(txt)
    holmes_manager.match() 
richardpaulhudson commented 1 year ago

Unfortunately the range of grammatical structures that can be recognized as multiwords, and so matched to multiwords in an ontology, is quite limited: essentially compound nouns like grocery score and names like Barack Obama. In these examples, time to time is not being matched to the ontology entry either in the search phrase or in the document, so that the component words are matching individually. I fear there's no direct way around this, although you can always write extra search phrases to pick up such expressions without using the ontology.