own-pt / openWordnet-PT

OpenWordnet-PT: an open access wordnet for Portuguese
http://openwordnet-pt.org
Other
154 stars 35 forks source link

satellite adjectives with duplicate lexical ids #154

Closed odanoburu closed 3 years ago

odanoburu commented 5 years ago

satellite adjectives are all from the same lexicographer file, and are differentiated by the head words of their cluster, not by their lexical ids; however I don't see why they shouldn't have sequential lexical ids -- they all currently have the same lexical id of 0 (this is inherited from PWN 3.0).

this query highlights the issue:

select ?ws1 ?ws2 ?lf
where {
  ?ss1 rdf:type wn30:AdjectiveSatelliteSynset .
  ?ss2 rdf:type wn30:AdjectiveSatelliteSynset .
  ?ss1 wn30:containsWordSense ?ws1 .
  ?ss2 wn30:containsWordSense ?ws2 .
  ?ws1 wn30:word ?w1 .
  ?ws2 wn30:word ?w2 .
  ?w1 wn30:lexicalForm ?lf .
  ?w2 wn30:lexicalForm ?lf .
  filter (?ws1 != ?ws2)
  }
hmuniz commented 5 years ago

The following script moves all satellite adjectives to adjs.all and fixes the lexical ids.

from rdflib import Graph, Namespace, Literal
from rdflib.namespace import RDFS
from wn2text import sort_synsets  # wn2text  is defined at https://github.com/own-pt/mill
import click

@click.command()
@click.argument('rdf_file', type=click.Path(exists=True, dir_okay=False, resolve_path=True),
                required=True)
@click.argument('rdf_final', required=True)
@click.option('-f', '--rdf-file-format', 'rdf_file_format', type=click.STRING, default='nt', show_default=True,
              help="Type of RDF input file. Must be accepted by RDFlib.")
def main(rdf_file, rdf_file_format, rdf_final):
    wn30 = Namespace("https://w3id.org/own-pt/wn30/schema/")
    graph = Graph()
    graph.parse(rdf_file, format=rdf_file_format)

    # CREATE ADJS.ALL

    graph.update("""
    DELETE {?s wn30:lexicographerFile "adj.all"}
    INSERT {?s wn30:lexicographerFile "adjs.all" }
    WHERE {
    ?s a wn30:AdjectiveSatelliteSynset .
    }
    """, initNs={'wn30': wn30})

    # fix lexical ids
    synsets = graph.subjects(wn30['lexicographerFile'], Literal("adjs.all"))
    aux = {}
    for synset, sorted_word_senses in sort_synsets(graph, synsets):
        for ws in sorted_word_senses:
            label = graph.value(ws, RDFS.label)
            new_lexid = aux[label] = aux.get(label, -1) + 1
            graph.remove((ws, wn30.lexicalId, None))
            graph.add((ws, wn30.lexicalId, Literal(str(new_lexid))))

    graph.serialize(rdf_final, format="nt")

if __name__ == '__main__':
    main()

Running:

 python3 fix_adj.py wordnet-en-update.nt wordner-update.nt
arademaker commented 3 years ago

@FredsoNerd o que precisamos confirmar é se nada foi perdido do modelo de dados da PWN. Existem os sense keys, existem os adjs head e satélites. para alguns exemplos, podemos olhar como estão nos DBFiles e como está o RDF. Se confirmarmos que nada foi perdido, me parece que este issue é relevante apenas para o projeto MILL que eventualmente podemos considerar voltar a trabalhar no futuro.

arademaker commented 3 years ago

vamos fechar este issue por agora. depois de avaliarmos algumas vezes, entendemos que isto seria uma sugestão de mudança em relação ao modelo adotado por PWN mas que podemos reavaliar no futuro.