kermitt2 / entity-fishing

A machine learning tool for fishing entities
http://nerd.readthedocs.io/
Apache License 2.0
246 stars 24 forks source link

Bad disambiguation of term "Maroc" in French #160

Open oterrier opened 9 months ago

oterrier commented 9 months ago

Cannot find any example to have "Maroc" disambiguated as the country (Q1028)

For example with this query

    "text": "Au Maroc, arrestation de trois membres présumés du groupe État islamique...\nTrois Marocains affiliés au groupe djihadiste État islamique ont été arrêtés hier selon la police. Ils sont soupçonnés d’avoir  assassiné un policier  dont  le corps calciné a été retrouvé début mars près de Casablanca.",
    "shortText": "",
    "termVector": [],
    "language": {
        "lang": "fr"
    },
    "entities": [],
    "mentions": [
        "wikipedia"
    ],
    "nbest": false,
    "sentence": false,
    "minSelectorScore": 0.2
}

It is disambiguated as French protectorate in Morocco (Q907234) Some other times as Morocco national football team (Q207337)

But never as Morocco (Q1028) nevertheless it is the concept with the higher conditional probability (0.903404988057546)

I can't explain why: any clue ?

Thx Olivier

kairntech commented 5 months ago

Some more recent tests in French

En fr, Wikidata sort sur les noms des pays : Allemagne : disambiguated as Empire allemand, Equipe d'Allemagne de football Grèce : disambiguated as Grèce antique Roumanie : disambiguated as Royaume de roumanie

whatever you put in maxTermFrequency

kairntech commented 5 months ago

request

{
    "text": "Fabrication d'un violoncelle dans un atelier de lutherie à Reghin, en Roumanie, le 22 janvier 2021.",
    "shortText": "",
    "termVector": [],
    "language": {
        "lang": "fr"
    },
    "entities": [],
    "mentions": [
        "wikipedia"
    ],
    "nbest": false,
    "sentence": false,
    "minSelectorScore": 0.2,
    "maxTermFrequency": 5
}

response

{
    "software": "entity-fishing",
    "version": "0.0.6",
    "date": "2024-05-23T14:31:45.359208132Z",
    "runtime": 31,
    "nbest": false,
    "text": "Fabrication d'un violoncelle dans un atelier de lutherie à Reghin, en Roumanie, le 22 janvier 2021.",
    "language": {
        "lang": "fr",
        "conf": 1
    },
    "global_categories": [
        {
            "weight": 0.14285714285714288,
            "source": "wikipedia-fr",
            "category": "Instrument de musique classique",
            "page_id": 199859
        },
        {
            "weight": 0.14285714285714288,
            "source": "wikipedia-fr",
            "category": "Violoncelle",
            "page_id": 986894
        },
        {
            "weight": 0.14285714285714288,
            "source": "wikipedia-fr",
            "category": "Municipalité dans le județ de Mureș",
            "page_id": 11926951
        },
        {
            "weight": 0.14285714285714288,
            "source": "wikipedia-fr",
            "category": "Instrument à cordes frottées",
            "page_id": 317874
        },
        {
            "weight": 0.14285714285714288,
            "source": "wikipedia-fr",
            "category": "Royaume de Roumanie",
            "page_id": 8183397
        },
        {
            "weight": 0.14285714285714288,
            "source": "wikipedia-fr",
            "category": "Page contenant une partition",
            "page_id": 13964105
        },
        {
            "weight": 0.14285714285714288,
            "source": "wikipedia-fr",
            "category": "Lutherie",
            "page_id": 1310062
        }
    ],
    "entities": [
        {
            "rawName": "violoncelle",
            "offsetStart": 17,
            "offsetEnd": 28,
            "confidence_score": 0.551,
            "wikipediaExternalRef": 10822,
            "wikidataId": "Q8371",
            "domains": [
                "Acoustics",
                "Artisanship"
            ]
        },
        {
            "rawName": "atelier de lutherie",
            "offsetStart": 37,
            "offsetEnd": 56,
            "confidence_score": 0.4053,
            "wikipediaExternalRef": 167295,
            "wikidataId": "Q3267878"
        },
        {
            "rawName": "Reghin",
            "offsetStart": 59,
            "offsetEnd": 65,
            "confidence_score": 0.8624,
            "wikipediaExternalRef": 3813284,
            "wikidataId": "Q572478",
            "domains": [
                "Geography",
                "Architecture"
            ]
        },
        {
            "rawName": "Roumanie",
            "offsetStart": 70,
            "offsetEnd": 78,
            "confidence_score": 0.6214,
            "wikipediaExternalRef": 1387867,
            "wikidataId": "Q203493",
            "domains": [
                "Military"
            ]
        },
        {
            "rawName": "22 janvier",
            "offsetStart": 83,
            "offsetEnd": 93,
            "confidence_score": 0.8398,
            "wikipediaExternalRef": 3688,
            "wikidataId": "Q2275",
            "domains": [
                "Geology",
                "Oceanography",
                "Earth"
            ]
        }
    ]
}
kermitt2 commented 5 months ago

Sorry for the late reply, this is weird indeed, I'll try to see what is happening in the disambiguation process for these countries.