Closed IsaacHaze closed 10 years ago
forgot to include the configuration file used:
# Copyright 2012-2013, University of Amsterdam. This program is free software:
# you can redistribute it and/or modify it under the terms of the GNU Lesser
# General Public License as published by the Free Software Foundation, either
# version 3 of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
# FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License
# for more details.
#
# You should have received a copy of the GNU Lesser General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
server:
port: 5000
host: 0.0.0.0
wpm:
languages:
# memory backend
nl:
source: wpm.wpmdata_inproc.WpmDataInProc
initparams:
path: /zfs/ilps-plexest/wikipediaminer/nlwiki-20130318
language: dutch
# translation_languages should be a list of iso 639-2 language
# codes
translation_languages: []
# Redis backend
# nl:
# source: wpm.wpmdata_redis.WpmDataRedis
# initparams:
# host: localhost
# port: 6379
threads: 16
bdburl: http://zookst13.science.uva.nl:8080/dutchsemcor/article
semanticize:
max_ngram_length: 12
linkprocs:
includefeatures: false
logging:
verbose: true
path: log.txt
format: '[%(asctime)-15s][%(levelname)s][%(module)s][%(pathname)s:%(lineno)d]: %(message)s'
misc:
tempdir: /tmp
Like we just discussed, problem is in WPM. This line is bad:
'misdaad-,34,34,14,13,v{s{1110917,34,34,F,F}}
14 should be more than 34, 13 should be more than 34. Looked at the code in LabelSensesStepjava and LabelOccurrencesStep.java and there seem to be some differences in how normalization and tokenization is done.
We're not going to touch the WPM code for now. Closing it...
curl http://localhost:5000/semanticize/nl -d text="Doordeweeks ochtendprogramma van publieke omroep WNL. Met aandacht voor sport, misdaad, politiek, showbussines, actuele reportages en studiogesprekken. Het programma is de opvolger van Ochtendspits van WNL." -d filter="senseProbability>1.0" > misdaad.json
geeft: