kbss-cvut / aircraft-maintenance-planning-system

D2020+ project about Aircraft Mainenance Planning System.
GNU Lesser General Public License v3.0
0 stars 0 forks source link

Automatize extraction of text analysis on specific dataset (stage 1) #103

Open blcham opened 2 years ago

blcham commented 2 years ago

The goal is to create scripts to extract HTML files from TermIt and process them to produce statistics.

We have the following artifacts:

Our goal is to create statistics such as this:

Terminology:

Additional notes:

Full task view:

A/C = implementing stage 1:

blcham commented 2 years ago

See https://github.com/kbss-cvut/aircraft-maintenance-planning-system/blob/main/aircraft-maintenance-planning-model/data/termit/termit-nlp.sms.ttl

blcham commented 1 year ago

Algorithm for stage 2 (I would start without custom rule first):

input: vocabulary IRI + html document with table + custom extraction rules

  1. convert html using tabular module (output is RDF representation of table)
  2. take every cell value (literals) that has term occurences () and annotate the cell value with the occurences (create extract-term-occurences-module)
  3. apply algorithm to select appropriate terms (terms with highest score + rules) (applyConstruct)
  4. create different RDF representation of the table that is compliant with this output
  5. convert RDF to CSV

Module extract-term-occurences-module:

Example of the module's inputs and outputs:


Custom rule: if ?x part-of ?y then select ?x if there are multiple components

Instance of the rule: if "security seal" part-of "first aid kit" then pick "security seal"

You can find relevant detail in 8th line of 2020-02-18-termit-export-full-dataset

Application of the rule should be done in applyConstruct:

CONSTRUCT ...
WHERE {
    ?x :part-of ?y  .
    ?x a :Component .
    ?y a :Component .
}

Rule pattern: ?x :part-of ?y where ?x a :Component, ?y a :Component .

Rule instance (this should be ideally loaded from ttl file from within GIT repository): :security-seal :part-of :first-aid-kit .


Related resources in https://graphdb.onto.fel.cvut.cz/ termit-csat repository:

http://onto.fel.cvut.cz/ontologies/application/termit/pojem/cíl-souborového-výskytu/instance416404681
http://onto.fel.cvut.cz/ontologies/application/termit/pojem/má-selektor
http://onto.fel.cvut.cz/ontologies/application/termit/pojem/selektor-pozici-v-textu/instance625469017
http://onto.fel.cvut.cz/ontologies/application/termit/pojem/má-startovní-pozici
http://onto.fel.cvut.cz/ontologies/application/termit/pojem/má-koncovou-pozici

Related query:

select * where {     
     #<http://onto.fel.cvut.cz/ontologies/application/termit/pojem/selektor-text-quote/instance-1002686696> 
     ?s ?p ?stq .
     ?stq  <http://onto.fel.cvut.cz/ontologies/application/termit/pojem/má-přesný-text-quote> ?tQuote .
    filter(contains(str(?tQuote), "finding"))
} limit 100