globalwordnet / english-wordnet

The Open English WordNet
https://en-word.net/
Other
465 stars 56 forks source link

EWE and OEWN script don't write lines of the same length #1054

Closed 1313ou closed 3 weeks ago

1313ou commented 3 weeks ago

Recent commit rewrites shorter lines for definitions (and BTW examples).

The different scripts (EWE's and OEWN's) must have different YAML-output configurations. The latter uses yaml.dump(data, stream,Dumper, **kwds) from the library, with hardly any argument, so it must rely on defaults. I don't know about the former.

Could they be made to agree on line length ? If they don't we will have different formats depending on the tool used, generating spurious 'flickering' changes.

More generally, a default YAML configuration is desirable.

The 'normalized' lines I produced (in the quotation/apostrophe overhaul) used the following code, adapted from wordnet_yaml.py:

def save_synsets(wn, out_repo):
    synset_yaml = {}
    for synset in wn.synsets:
        s = {}
        if synset.ili and synset.ili != "in":
            s["ili"] = synset.ili
        s["partOfSpeech"] = synset.part_of_speech.value
        definitions = [wordnet_yaml.definition_to_yaml(wn, d) for d in synset.definitions]
        s["definition"] = definitions
        if synset.examples:
            examples = [wordnet_yaml.example_to_yaml(wn, x) for x in synset.examples]
            s["example"] = examples
        if synset.source:
            s["source"] = synset.source
        if synset.wikidata:
            s["wikidata"] = synset.wikidata
        for r in synset.synset_relations:
            if r.rel_type not in wordnet_yaml.ignored_symmetric_synset_rels:
                if r.rel_type.value not in s:
                    s[r.rel_type.value] = [r.target[wordnet_yaml.KEY_PREFIX_LEN:]]
                else:
                    s[r.rel_type.value].append(r.target[wordnet_yaml.KEY_PREFIX_LEN:])
        if synset.lex_name not in synset_yaml:
            synset_yaml[synset.lex_name] = {}
        synset_yaml[synset.lex_name][synset.id[wordnet_yaml.KEY_PREFIX_LEN:]] = s
        s["members"] = members(wn, synset)
        # BUG : this does not preserve order
        # s["members"] = entries_ordered(wn, synset.id)
    for key, synsets in synset_yaml.items():
        with wordnet_yaml.codecs.open("%s/src/yaml/%s.yaml" % (out_repo, key), "w", "utf-8") as output:
            output.write(wordnet_yaml.yaml.dump(synsets, default_flow_style=False, allow_unicode=True))

def members(wn, synset):
    """ preserve order in synset """
    return [wn.id2entry[m].lemma.written_form for m in synset.members]

ALSO: Incidentally, the code was adapted from wordnet_yaml.py, because the original (and still current) code contains to my sense a bug which makes the saving non deterministic in members' order, which it changes. The culprit being entries_ordered function. Testing involves loading the yaml data and rewriting it without any change. This should be a blank operation but is not.

jmccrae commented 3 weeks ago

Thanks for the comment. Actually as part of #1030, I plan to remove the write functionality from the scripts in OEWN entirely, so this bug is a "won't fix".

It would be good however to have a Python based method for making programmatic changes in the YAML (possibly integrated with @goodmami's wn library), that doesn't "flicker" the lines.

The Rust system for EWE uses some very custom code, to attempt to get the same result as Python, but it obviously fails from time-to-time on this. I will look into some of the "flickers" and see if I can fix this again.

jmccrae commented 3 weeks ago

The following commit https://github.com/globalwordnet/english-wordnet/commit/44d25fc94b2ddf034984d6666fb79c01a082a263 now adopts the use of Python formatting so the following should not change the source YAML

from glob import glob
import yaml

for f in glob("src/yaml/*.yaml"):
    data = yaml.load(open(f), Loader=yaml.CLoader)
    with open(f, "w") as out:
        out.write(yaml.dump(data, default_flow_style=False,
            allow_unicode=True))

I believe this should close the issue.

@1313ou does this satisfy you?

1313ou commented 3 weeks ago

It would be nice to have a Python idempotent load()-save() function pair for the wn model/object. It doesn't shock me as irrelevant to the minimal core library in /scripts. Quite the contrary : as I see it the core must consist of a load()/save() pair. The XML conversion is certainly desirable as part of the core too, though I don't use it (I'd rather use YAML directly).

One is then free to implement textual/structural modifications of the model/object as one likes and EWE is certainly a good option. Modification code doesn't have to be part of the core and if there's on thing to ditch it's this ... and offload editing to EWE ... and constrain EWE to output the same format.

The function above does the job for the synsets with a one-line modification. So if the save function is 'fixed' this way it will do the job too. And the code can be used as a library, which is what I did. So I'm confident the goal can be reached with minimal work.

Also, one can possibly tweak the configuration passed to yaml.dump() so that the line length is adjusted as desired. Still to be explored.

My understanting is that @goodmami's wn library works wih XML not YAML. Or does it ?

1313ou commented 3 weeks ago

Sorry my previous comment was started before I got yours. In reference to your last comment: If it works, then I'll be satisfied. The question is : does it produce the same format as EWE ? This is necessary if we want to avoid the "flicker" in Git with PR using different tools ?

1313ou commented 3 weeks ago

Was the commit produced by EWE ?

goodmami commented 3 weeks ago

Python based method for making programmatic changes in the YAML (possibly integrated with @goodmami's wn library)

...

My understanting is that @goodmami's wn library works wih XML not YAML. Or does it ?

Wn does not read or write the YAML format, nor does it directly support editing of wordnets (but see https://github.com/Hypercookie/wn-editor for an extension for editing). You can directly manipulate the database via SQL statements or work with in-memory representations of WN-LMF files, but neither of these are very user-friendly.

I'm not entirely opposed to adding support for the YAML format, assuming it has stabilized.

1313ou commented 3 weeks ago

After examining and testing your suggested code I can say it works fine for normalizing the YAML. It looks like long lines are wrapped at 90 or thereabouts. So I take it PRs are to be normalized this way to avoid "flicker".

But I realize now my comment was two-fold and also about the need for a minimal library, that I advocated in an earlier comment, with, at its core, a load_model()/save_model() pair that brings a live wn object into memory and persists it to files. The current definition of the model will do but can be optimized and the current solution I use, adapted from your code, is not satisfactory in that it has to be patched.

If you replace line wordnet_yaml:416

s["members"] = entries_ordered(wn, synset.id) with s["members"] = [wn.id2entry[m].lemma.written_form for m in synset.members] I think it will do the job.

jmccrae commented 3 weeks ago

Was the commit produced by EWE ?

Yes, this was produced by EWE and verified by the script above.

After examining and testing your suggested code I can say it works fine for normalizing the YAML. It looks like long lines are wrapped at 90 or thereabouts. So I take it PRs are to be normalized this way to avoid "flicker".

It is 80 characters. EWE was wrapping at 80 UTF-8 bytes rather than characters, which was causing the flicker

I'm not entirely opposed to adding support for the YAML format, assuming it has stabilized.

I meant that we should have an internal load/save script that uses the wn library as its in-memory model. Thanks for the link to https://github.com/Hypercookie/wn-editor, this seems very useful, for implementing this.

If you replace line wordnet_yaml:416

  s["members"] = entries_ordered(wn, synset.id)

with s["members"] = [wn.id2entry[m].lemma.written_form for m in synset.members]

Yes, implemented. It seems that this was some legacy hack based on sense IDs that does not apply any more.

1313ou commented 3 weeks ago

Fine now. Thanks John.

1313ou commented 3 weeks ago

Basic load/save support could be this oewnio.py file now:

#!/usr/bin/python3

import os
import pickle

import wordnet_yaml

def load(repo):
    """ Load model from repo (home, not src/yaml dir)"""
    current_dir = os.getcwd()
    os.chdir(repo)
    wn = wordnet_yaml.load()
    os.chdir(current_dir)
    return wn

def save(wn, out_repo):
    """ Save model to repo (home, not src/yaml dir) """
    current_dir = os.getcwd()
    os.chdir(out_repo)
    wordnet_yaml.save(wn)
    os.chdir(current_dir)

def normalize(repo):
    """ Normalize repo (home, not src/yaml dir)"""
    for f in glob(f"{repo}/src/yaml/*.yaml"):
        data = yaml.load(open(f), Loader=yaml.CLoader)
        with open(f, "w") as out:
            out.write(yaml.dump(data, default_flow_style=False, allow_unicode=True))

def load_pickle(path, file='wn.pickle'):
    """ Load model from pickle file in path """
    return pickle.load(open(f"{path}/{file}", "rb"))

def save_pickle(wn, path, file='wn.pickle'):
    """ Save model to pickle file in path """
    pickle.dump(wn, open(f"{path}/{file}", "wb"))