Open blcham opened 2 years ago
Algorithm for stage 2 (I would start without custom rule first):
input: vocabulary IRI + html document with table + custom extraction rules
Module extract-term-occurences-module:
Example of the module's inputs and outputs:
input RDF:
:x a csvw:row
:csat-wo-tc "4339272" ;
:tc-reference "52-610-00-04" ;
:wo-text "<span about="_:a970-5" property="ddo:je-výskytem-termu" resource="http://onto.fel.cvut.cz/ontologies/slovnik/slovnik-komponent-a-zavad---novy/pojem/missing-part" typeof="ddo:výskyt-termu" score="0.5">finding</span> (nrc) taskcard 531900-03-1 (1.0) / item 1 <br> there were <span about="_:a970-6" property="ddo:je-výskytem-termu" resource="http://onto.fel.cvut.cz/ontologies/slovnik/slovnik-komponent-a-zavad---novy/pojem/missing-part" typeof="ddo:výskyt-termu" score="0.5">found</span> <span id="id33cs" about="_:33cs" property="http://onto.fel.cvut.cz/ontologies/application/termit/pojem/je-výskytem-termu" resource="http://onto.fel.cvut.cz/ontologies/slovnik/slovnik-komponent-a-zavad---novy/pojem/broken-part" typeof="http://onto.fel.cvut.cz/ontologies/application/termit/pojem/výskyt-termu" class="assigned-term-occurrence selected-occurrence">broken</span> <span id="idc342-14" about="_:c342-14" property="http://onto.fel.cvut.cz/ontologies/application/termit/pojem/je-výskytem-termu" resource="http://onto.fel.cvut.cz/ontologies/slovnik/slovnik-komponent-a-zavad---novy/pojem/drain-valve" typeof="http://onto.fel.cvut.cz/ontologies/application/termit/pojem/výskyt-termu" class="assigned-term-occurrence selected-occurrence">drain valves</span>" ;
output RDF (see RDF4J of termit how it looks exactly):
_:a970-5 a ddo:výskyt-termu ;
:score "0.5"^^integer ;
ddo:je-výskytem-termu "http://onto.fel.cvut.cz/ontologies/slovnik/slovnik-komponent-a-zavad---novy/pojem/missing-part"
:references-annotation "<span about="_:a970-5" property="ddo:je-výskytem-termu" resource="http://onto.fel.cvut.cz/ontologies/slovnik/slovnik-komponent-a-zavad---novy/pojem/missing-part" typeof="ddo:výskyt-termu" score="0.5">finding</span> (nrc) taskcard 531900-03-1 (1.0) / item 1 <br> there were <span about="_:a970-6" property="ddo:je-výskytem-termu" resource="http://onto.fel.cvut.cz/ontologies/slovnik/slovnik-komponent-a-zavad---novy/pojem/missing-part" typeof="ddo:výskyt-termu" score="0.5">found</span> <span id="id33cs" about="_:33cs" property="http://onto.fel.cvut.cz/ontologies/application/termit/pojem/je-výskytem-termu" resource="http://onto.fel.cvut.cz/ontologies/slovnik/slovnik-komponent-a-zavad---novy/pojem/broken-part" typeof="http://onto.fel.cvut.cz/ontologies/application/termit/pojem/výskyt-termu" class="assigned-term-occurrence selected-occurrence">broken</span> <span id="idc342-14" about="_:c342-14" property="http://onto.fel.cvut.cz/ontologies/application/termit/pojem/je-výskytem-termu" resource="http://onto.fel.cvut.cz/ontologies/slovnik/slovnik-komponent-a-zavad---novy/pojem/drain-valve" typeof="http://onto.fel.cvut.cz/ontologies/application/termit/pojem/výskyt-termu" class="assigned-term-occurrence selected-occurrence">drain valves</span>" ;
:references-text "finding (nrc) taskcard 531900-03-1 (1.0) / item 1 there were found broken drain valves1"
:annotation-in-text-start "0"^^integer ;
:annotation-in-text-end "7"^^integer ;
.
Custom rule: if ?x part-of ?y then select ?x if there are multiple components
Instance of the rule: if "security seal" part-of "first aid kit" then pick "security seal"
You can find relevant detail in 8th line of 2020-02-18-termit-export-full-dataset
Application of the rule should be done in applyConstruct:
CONSTRUCT ...
WHERE {
?x :part-of ?y .
?x a :Component .
?y a :Component .
}
Rule pattern: ?x :part-of ?y where ?x a :Component, ?y a :Component .
Rule instance (this should be ideally loaded from ttl file from within GIT repository): :security-seal :part-of :first-aid-kit .
Related resources in https://graphdb.onto.fel.cvut.cz/ termit-csat repository:
http://onto.fel.cvut.cz/ontologies/application/termit/pojem/cíl-souborového-výskytu/instance416404681
http://onto.fel.cvut.cz/ontologies/application/termit/pojem/má-selektor
http://onto.fel.cvut.cz/ontologies/application/termit/pojem/selektor-pozici-v-textu/instance625469017
http://onto.fel.cvut.cz/ontologies/application/termit/pojem/má-startovní-pozici
http://onto.fel.cvut.cz/ontologies/application/termit/pojem/má-koncovou-pozici
Related query:
select * where {
#<http://onto.fel.cvut.cz/ontologies/application/termit/pojem/selektor-text-quote/instance-1002686696>
?s ?p ?stq .
?stq <http://onto.fel.cvut.cz/ontologies/application/termit/pojem/má-přesný-text-quote> ?tQuote .
filter(contains(str(?tQuote), "finding"))
} limit 100
The goal is to create scripts to extract HTML files from TermIt and process them to produce statistics.
We have the following artifacts:
Our goal is to create statistics such as this:
Terminology:
Additional notes:
Full task view:
A/C = implementing stage 1: