semanticize / semanticizer

Entity Linking for the masses
http://semanticize.uva.nl/
GNU General Public License v3.0
56 stars 15 forks source link

linkProbability calculation is buggy #16

Closed IsaacHaze closed 10 years ago

IsaacHaze commented 10 years ago

curl http://localhost:5000/semanticize/nl -d text="Doordeweeks ochtendprogramma van publieke omroep WNL. Met aandacht voor sport, misdaad, politiek, showbussines, actuele reportages en studiogesprekken. Het programma is de opvolger van Ochtendspits van WNL." -d filter="senseProbability>1.0" > misdaad.json

geeft:

{
    "links": [
        {
            "fromRedirect": false, 
            "fromTitle": false, 
            "id": 1110917, 
            "label": "misdaad-", 
            "linkProbability": 2.6153846153846154, 
            "priorProbability": 1.0, 
            "senseProbability": 2.6153846153846154, 
            "text": "misdaad", 
            "title": "Misdaadfilm", 
            "url": "http://nl.wikipedia.org/wiki/Misdaadfilm"
        }
    ], 
    "text": "Doordeweeks ochtendprogramma van publieke omroep WNL. Met aandacht voor sport, misdaad, politiek, showbussines, actuele reportages en studiogesprekken. Het programma is de opvolger van Ochtendspits van WNL."
}
IsaacHaze commented 10 years ago

forgot to include the configuration file used:

# Copyright 2012-2013, University of Amsterdam. This program is free software:
# you can redistribute it and/or modify it under the terms of the GNU Lesser 
# General Public License as published by the Free Software Foundation, either 
# version 3 of the License, or (at your option) any later version.
# 
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 
# FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License 
# for more details.
# 
# You should have received a copy of the GNU Lesser General Public License 
# along with this program. If not, see <http://www.gnu.org/licenses/>.

server:
  port: 5000
  host: 0.0.0.0

wpm:
  languages:
    # memory backend
    nl:
      source: wpm.wpmdata_inproc.WpmDataInProc
      initparams:
        path: /zfs/ilps-plexest/wikipediaminer/nlwiki-20130318
        language: dutch
        # translation_languages should be a list of iso 639-2 language
        # codes
        translation_languages: []
    # Redis backend
    # nl:
    #   source: wpm.wpmdata_redis.WpmDataRedis
    #   initparams:
    #     host: localhost
    #     port: 6379
  threads: 16
  bdburl: http://zookst13.science.uva.nl:8080/dutchsemcor/article

semanticize:
  max_ngram_length: 12

linkprocs:
  includefeatures: false

logging:
  verbose: true
  path: log.txt
  format: '[%(asctime)-15s][%(levelname)s][%(module)s][%(pathname)s:%(lineno)d]: %(message)s'

misc:
  tempdir: /tmp
dodijk commented 10 years ago

Like we just discussed, problem is in WPM. This line is bad:

'misdaad-,34,34,14,13,v{s{1110917,34,34,F,F}}                                                                                                                                                                    

14 should be more than 34, 13 should be more than 34. Looked at the code in LabelSensesStepjava and LabelOccurrencesStep.java and there seem to be some differences in how normalization and tokenization is done.

dodijk commented 10 years ago

We're not going to touch the WPM code for now. Closing it...