insight-centre / naisc

Naisc - Automated Linking Tool
Apache License 2.0
8 stars 1 forks source link

OntoLex input format with configs/ontolex-default.json #5

Closed kernc closed 3 years ago

kernc commented 3 years ago

Using the following OntoLex RDF/Turtle, stripped down from example:

# /tmp/test.ttl:
@prefix lexinfo: <http://www.lexinfo.net/ontology/3.0/lexinfo#> .
@prefix ontolex: <http://www.w3.org/ns/lemon/ontolex#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .

<#6023645c844272a0838f61e4> a ontolex:Word ;
    lexinfo:partOfSpeech lexinfo:commonNoun ;
    ontolex:canonicalForm [ ontolex:writtenRep "cat"@en, "mačka"@sl ] ;
    ontolex:otherForm [ ontolex:writtenRep "🐈"@en, "🐈"@sl ] ;
    ontolex:sense <#cat-n-1>, <#cat-n-2> .

<#6023645c844272a0838f61e5> a ontolex:Word ;
    lexinfo:partOfSpeech lexinfo:verb ;
    ontolex:canonicalForm [ ontolex:writtenRep "cat"@en ] ;
    ontolex:sense <#cat-v-1>, <#cat-v-2> .

<#cat-n-1>
    skos:definition "a type of animal"@en, "vrsta živali"@sl ;
    ontolex:reference <http://dbpedia.org/page/Cat> .

<#cat-n-2>
    skos:definition "an attractive woman"@en, "privlačna ženska"@sl .

<#cat-v-1>
    skos:definition "print contents of a computer file"@en .

<#cat-v-2>
    skos:definition "raise (an anchor) from the surface of the water to the cathead"@en .

Running Naisc with -c configs/auto.json, I get roughly the expected links:

% ./naisc.sh /tmp/test.ttl /tmp/test.ttl -c configs/auto.json

Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/6.7/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 1s
19 actionable tasks: 19 up-to-date
[INITIALIZING] Reading Configuration
[INITIALIZING] Reading left dataset
[INITIALIZING] Reading right dataset
[INITIALIZING] Loading blocking strategy
[INITIALIZING] INFO: Treating as OntoLex matching task
[INITIALIZING] Loading lenses
[INITIALIZING] INFO: Automatically configuring lenses
[INITIALIZING] INFO: Using URIs as a lens (0.6667)
[INITIALIZING] INFO: Using http://www.w3.org/2004/02/skos/core#definition as a lens (1.0000)
[INITIALIZING] Loading Feature Extractors
[INITIALIZING] Loading Scorers
[INITIALIZING] Loading Matcher
[BLOCKING] Blocking
[BLOCKING] Loading Graph Extractors
[INITIALIZING] INFO: Automatically configuring graph features
[INITIALIZING] INFO: Using the following properties as values matches: 
http://www.w3.org/ns/lemon/ontolex#sense <-> http://www.w3.org/ns/lemon/ontolex#sense
http://www.w3.org/2004/02/skos/core#definition <-> http://www.w3.org/2004/02/skos/core#definition
http://www.w3.org/ns/lemon/ontolex#reference <-> http://www.w3.org/ns/lemon/ontolex#reference
http://www.lexinfo.net/ontology/3.0/lexinfo#partOfSpeech <-> http://www.lexinfo.net/ontology/3.0/lexinfo#partOfSpeech

[SCORING] Scoring
[SCORING] Scored 20 pairs
[MATCHING] Matching
[MATCHING] Predicted 4/4 alignments (0 non-finite, probability=2.5106)
[FINALIZING] Saving
<file:///tmp/test.ttl#cat-v-2> <http://www.w3.org/2004/02/skos/core#exactMatch> <file:///tmp/test.ttl#cat-v-2> . # 0.7778
<file:///tmp/test.ttl#cat-v-1> <http://www.w3.org/2004/02/skos/core#exactMatch> <file:///tmp/test.ttl#cat-v-1> . # 0.8333
<file:///tmp/test.ttl#cat-n-2> <http://www.w3.org/2004/02/skos/core#exactMatch> <file:///tmp/test.ttl#cat-n-2> . # 1.0000
<file:///tmp/test.ttl#cat-n-1> <http://www.w3.org/2004/02/skos/core#exactMatch> <file:///tmp/test.ttl#cat-n-1> . # 0.8889
[COMPLETED] Done
./naisc.sh /tmp/test.ttl /tmp/test.ttl -c configs/auto.json  20.44s user  3.43s system  127% cpu  1683M mem  18.733s total

However, running with -c configs/ontolex-default.json, it crashes for me as below:

% ./naisc.sh /tmp/test.ttl /tmp/test.ttl -c configs/ontolex-default.json

Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/6.7/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 1s
19 actionable tasks: 19 up-to-date
[INITIALIZING] Reading Configuration
[INITIALIZING] Reading left dataset
[INITIALIZING] Reading right dataset
[INITIALIZING] Loading blocking strategy
[INITIALIZING] Loading lenses
java.lang.IllegalArgumentException: Cannot deserialize instance of `java.lang.String` out of START_ARRAY token
 at [Source: UNKNOWN; line: -1, column: -1] (through reference chain: org.insightcentre.uld.naisc.lens.Label$Configuration["property"])
    at com.fasterxml.jackson.databind.ObjectMapper._convert(ObjectMapper.java:3916)
    at com.fasterxml.jackson.databind.ObjectMapper.convertValue(ObjectMapper.java:3847)
    at org.insightcentre.uld.naisc.lens.Label.makeLens(Label.java:35)
    at org.insightcentre.uld.naisc.main.Configuration.makeLenses(Configuration.java:193)
    at org.insightcentre.uld.naisc.main.Main.execute(Main.java:281)
    at org.insightcentre.uld.naisc.main.Main.execute2(Main.java:188)
    at org.insightcentre.uld.naisc.main.Main.execute(Main.java:150)
    at org.insightcentre.uld.naisc.main.Main.execute(Main.java:108)
    at org.insightcentre.uld.naisc.main.Main.execute(Main.java:84)
    at org.insightcentre.uld.naisc.main.Main.main(Main.java:584)
Caused by: com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot deserialize instance of `java.lang.String` out of START_ARRAY token
 at [Source: UNKNOWN; line: -1, column: -1] (through reference chain: org.insightcentre.uld.naisc.lens.Label$Configuration["property"])
    at com.fasterxml.jackson.databind.exc.MismatchedInputException.from(MismatchedInputException.java:63)
    at com.fasterxml.jackson.databind.DeserializationContext.reportInputMismatch(DeserializationContext.java:1364)
    at com.fasterxml.jackson.databind.DeserializationContext.handleUnexpectedToken(DeserializationContext.java:1140)
    at com.fasterxml.jackson.databind.deser.std.StdDeserializer._deserializeFromArray(StdDeserializer.java:678)
    at com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:40)
    at com.fasterxml.jackson.databind.deser.std.StringDeserializer.deserialize(StringDeserializer.java:10)
    at com.fasterxml.jackson.databind.deser.impl.FieldProperty.deserializeAndSet(FieldProperty.java:138)
    at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:288)
    at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:151)
    at com.fasterxml.jackson.databind.ObjectMapper._convert(ObjectMapper.java:3911)
    ... 9 more
[FAILED] java.lang.IllegalArgumentException: Cannot deserialize instance of `java.lang.String` out of START_ARRAY token
 at [Source: UNKNOWN; line: -1, column: -1] (through reference chain: org.insightcentre.uld.naisc.lens.Label$Configuration["property"])
[FAILED] CRITICAL: The process failed due to an exception: java.lang.IllegalArgumentException: Cannot deserialize instance of `java.lang.String` out of START_ARRAY token
 at [Source: UNKNOWN; line: -1, column: -1] (through reference chain: org.insightcentre.uld.naisc.lens.Label$Configuration["property"])
[COMPLETED] Done

Would you happen to have an idea what's the issue about and how I might mitigate it?

jmccrae commented 3 years ago

Typo in the configuration file. Fixed by https://github.com/insight-centre/naisc/commit/c3ed081a391f8e7c82eaed76c10bd95011788dea

kernc commented 3 years ago

Right. But now, running:

./naisc.sh /tmp/test.ttl /tmp/test.ttl -c configs/ontolex-default.json

I get:

[INITIALIZING] Loading Scorers
org.insightcentre.uld.naisc.main.ConfigurationException: Model file does not exist. (Perhaps you need to train this model?)
    at org.insightcentre.uld.naisc.scorer.LibSVM.makeScorer(LibSVM.java:58)
    at org.insightcentre.uld.naisc.main.Configuration.makeScorer(Configuration.java:206)
    at org.insightcentre.uld.naisc.main.Main.execute(Main.java:287)
    at org.insightcentre.uld.naisc.main.Main.execute2(Main.java:188)
    at org.insightcentre.uld.naisc.main.Main.execute(Main.java:150)
    at org.insightcentre.uld.naisc.main.Main.execute(Main.java:108)
    at org.insightcentre.uld.naisc.main.Main.execute(Main.java:84)
    at org.insightcentre.uld.naisc.main.Main.main(Main.java:584)
[FAILED] org.insightcentre.uld.naisc.main.ConfigurationException: Model file does not exist. (Perhaps you need to train this model?)

I don't think models/default.libsvm, referenced in ontolex-default.json, comes by get-models.sh. I'm not supposed to train the model, am I? Can I switch it for a different model that is available/works?

jmccrae commented 3 years ago

Yeah, that model does need to be trained. There are two solutions here

  1. Train the model with train.sh. (Better performance)
  2. Change the scorer to an unsupervised model, e.g., switch "scorer.LibSVM" with "scorer.RAdLR" (worse performance)

I guess for deployment we will provide a trained model, but you can work with the unsupervised model for now