Closed odanoburu closed 3 years ago
The following script moves all satellite adjectives to adjs.all
and fixes the lexical ids.
from rdflib import Graph, Namespace, Literal
from rdflib.namespace import RDFS
from wn2text import sort_synsets # wn2text is defined at https://github.com/own-pt/mill
import click
@click.command()
@click.argument('rdf_file', type=click.Path(exists=True, dir_okay=False, resolve_path=True),
required=True)
@click.argument('rdf_final', required=True)
@click.option('-f', '--rdf-file-format', 'rdf_file_format', type=click.STRING, default='nt', show_default=True,
help="Type of RDF input file. Must be accepted by RDFlib.")
def main(rdf_file, rdf_file_format, rdf_final):
wn30 = Namespace("https://w3id.org/own-pt/wn30/schema/")
graph = Graph()
graph.parse(rdf_file, format=rdf_file_format)
# CREATE ADJS.ALL
graph.update("""
DELETE {?s wn30:lexicographerFile "adj.all"}
INSERT {?s wn30:lexicographerFile "adjs.all" }
WHERE {
?s a wn30:AdjectiveSatelliteSynset .
}
""", initNs={'wn30': wn30})
# fix lexical ids
synsets = graph.subjects(wn30['lexicographerFile'], Literal("adjs.all"))
aux = {}
for synset, sorted_word_senses in sort_synsets(graph, synsets):
for ws in sorted_word_senses:
label = graph.value(ws, RDFS.label)
new_lexid = aux[label] = aux.get(label, -1) + 1
graph.remove((ws, wn30.lexicalId, None))
graph.add((ws, wn30.lexicalId, Literal(str(new_lexid))))
graph.serialize(rdf_final, format="nt")
if __name__ == '__main__':
main()
Running:
python3 fix_adj.py wordnet-en-update.nt wordner-update.nt
@FredsoNerd o que precisamos confirmar é se nada foi perdido do modelo de dados da PWN. Existem os sense keys, existem os adjs head e satélites. para alguns exemplos, podemos olhar como estão nos DBFiles e como está o RDF. Se confirmarmos que nada foi perdido, me parece que este issue é relevante apenas para o projeto MILL que eventualmente podemos considerar voltar a trabalhar no futuro.
vamos fechar este issue por agora. depois de avaliarmos algumas vezes, entendemos que isto seria uma sugestão de mudança em relação ao modelo adotado por PWN mas que podemos reavaliar no futuro.
satellite adjectives are all from the same lexicographer file, and are differentiated by the head words of their cluster, not by their lexical ids; however I don't see why they shouldn't have sequential lexical ids -- they all currently have the same lexical id of
0
(this is inherited from PWN 3.0).this query highlights the issue: