globalwordnet / english-wordnet

The Open English WordNet
https://en-word.net/
Other
476 stars 58 forks source link

An entry in the index.sense file has 6 fields instead of 5 #1123

Closed drewvid closed 1 month ago

drewvid commented 1 month ago

Release format Delete as required: WNDB

Describe the bug One of the entries in the index.sense file has six fields instead of 5: capital:_critique_of_political_economy%1:10:01:: 07423067 1 0

To Reproduce Steps to reproduce the behavior: Compare the number of fields in these two entries: line 28564 -> capital : _critique_of_political_economy%1 : 10 : 01 : : 07423067 1 0 line 28565 -> capital_account%1 : 21 : 01 : : 13565485 2 0

Expected behavior I would expect the number of fields to be 5

jmccrae commented 1 month ago

Could you provide some more information about the environment that this is occurring in?

The correct way to understand sense keys is to first split the field by % and then by : as described here.

In this case an identifier is valid if it has 4 : after the %, which these identifiers follow.

1313ou commented 1 month ago

EWE and OEWN diverge in how to generate such a sensekey since #1fbd3c1b198270f000eaaa796380e6a1bd221f91

OEWN def get_sense_key() has (line 124) this:

lemma = (e.lemma.written_form
        .replace(" ", "_")
        .replace("&apos", "'")
        .replace(":", "-cn-")
        .lower())

while EWE pub fn get_sense_key() has (line 157):

    let lemma = lemma.replace(" ", "_").replace("&apos", "'").to_lowercase();

But indeed the sensekey is the rogue one if you split with re.split('[%:]', sk).

1313ou commented 1 month ago

I think we should avoid sensekeys with more than one percent and more than 4 colons. The referenced code above does not consider the head lemma, which itself may have % and : , if we accept lemmas in sensekeys to have them unescaped. This may lead to unparsable sensekeys, which goes against the original intention.

jmccrae commented 1 month ago

This one rogue lemma with a colon in it has created so many headaches!

The Princeton WordNet command line tools seem to work with the most recent release

-> % wordnet "Capital: Critique of Political Economy" -synsn

Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun capital:_critique_of_political_economy

1 sense of capital: critique of political economy                       

Sense 1
Capital: Critique of Political Economy, Das Kapital, Capital
       INSTANCE OF=> book

They interpret lemmas by simply lowercasing everything and replacing spaces with underscore. I would call this the 'reference implementation' of the WNDB format, i.e., if it works with the command line tool, then it is a correct implementation. Thus, an unescaped : in the lemma is correct. (This of course means we can never have a lemma with a % in it).

The issue with escaping the : would be that you would not be able to use the original Princeton tools to look for that lemma as they do not implement any escaping.

jmccrae commented 1 month ago

EWE and OEWN diverge in how to generate such a sensekey since #1fbd3c1b198270f000eaaa796380e6a1bd221f91

OEWN def get_sense_key() has (line 124) this:

lemma = (e.lemma.written_form
        .replace(" ", "_")
        .replace("&apos", "'")
        .replace(":", "-cn-")
        .lower())

while EWE pub fn get_sense_key() has (line 157):

    let lemma = lemma.replace(" ", "_").replace("&apos", "'").to_lowercase();

But indeed the sensekey is the rogue one if you split with re.split('[%:]', sk).

This is in fact not a bug. EWE is reading the sense key from YAML (capital:_critique_of_political_economy%1:10:01::), where as this repo's validation is reading from XML where the : has been escaped so that it is a valid XML identifier (oewn-capital-cn-_critique_of_political_economy__1.10.01..). I will make a small documentary fix to make this clearer.

jmccrae commented 1 month ago

EWE and OEWN diverge in how to generate such a sensekey since #1fbd3c1b198270f000eaaa796380e6a1bd221f91 OEWN def get_sense_key() has (line 124) this:

lemma = (e.lemma.written_form
        .replace(" ", "_")
        .replace("&apos", "'")
        .replace(":", "-cn-")
        .lower())

while EWE pub fn get_sense_key() has (line 157):

    let lemma = lemma.replace(" ", "_").replace("&apos", "'").to_lowercase();

But indeed the sensekey is the rogue one if you split with re.split('[%:]', sk).

This is in fact not a bug. EWE is reading the sense key from YAML (capital:_critique_of_political_economy%1:10:01::), where as this repo's validation is reading from XML where the : has been escaped so that it is a valid XML identifier (oewn-capital-cn-_critique_of_political_economy__1.10.01..). I will make a small documentary fix to make this clearer.

Turns it out it was a bug... or more accurately a bug incorrectly fixed. See #1124

drewvid commented 1 month ago

I wrote my own code to parse the entries in the worknet3.1 database using the sense indexes to perform a seek. When reading in the index.sense file, I was splitting each line with the ":" character. I guessed I would have to change the way I do this, so thanks, everyone, for putting me straight. Problem solved :) Some links to my online apps where I use WordNet are: 1) https://realtimetools.co.uk/wnbrowser 2) https://realtimetools.co.uk/wordplay 3) https://realtimetools.co.uk/connect

As there was only one entry with an extra ":" character, I was simply ignoring it, but now I know I should change the way I parse each entry.

jmccrae commented 1 month ago

@1313ou are you okay to close as wontfix?

drewvid commented 1 month ago

Sure, I just needed to catch up with the times! Nothing needs fixing!

drewvid commented 1 month ago

Thanks for explaining how I should update my code.

1313ou commented 1 month ago

I agree that the Python function is called in validation, not generation.

But we have to consider that if we accept a YAML sensekey with unescaped % and : we will will run into problems sooner than later. In fact I'm 100% sure. BTW will we have a oewn-100%%4:0:0:fully:00 ?

Examples with colons would involve ratios: S:N (signal to noise) , used in signal processing and electronics w:v (weight to volume ratio) HDL:LDL Ratio of High-Density Lipoprotein to Low-Density Lipoprotein, used in medical science. C:N ratio Carbon to Nitrogen ratio, important in soil science and composting. V:H ratio Vertical to Horizontal ratio, used in engineering and construction. I:O ratio Input to Output ratio, used in various fields including economics and computer science.

Non-ratios examples: 11:11 Often considered a "magic" or "wishing" time, associated with numerology and superstition. 4:20 Associated with cannabis culture, used as a code for consumption or as a time for related events. 15:00 In military and aviation contexts, often pronounced "fifteen hundred hours."

A high %CPU (Percent of CPU usage) would be required by legacy tools to handle the derived sensekeys.

Legacy tools will have to adapt one day ! If only to handle UTF-8.

Otherwise I don' t mind closing the issue but the issue will remain.

jmccrae commented 1 month ago

I agree that the Python function is called in validation, not generation.

But we have to consider that if we accept a YAML sensekey with unescaped % and : we will will run into problems sooner than later. In fact I'm 100% sure. BTW will we have a oewn-100%%4:0:0:fully:00 ?

Examples with colons would involve ratios: S:N (signal to noise) , used in signal processing and electronics w:v (weight to volume ratio) HDL:LDL Ratio of High-Density Lipoprotein to Low-Density Lipoprotein, used in medical science. C:N ratio Carbon to Nitrogen ratio, important in soil science and composting. V:H ratio Vertical to Horizontal ratio, used in engineering and construction. I:O ratio Input to Output ratio, used in various fields including economics and computer science.

Non-ratios examples: 11:11 Often considered a "magic" or "wishing" time, associated with numerology and superstition. 4:20 Associated with cannabis culture, used as a code for consumption or as a time for related events. 15:00 In military and aviation contexts, often pronounced "fifteen hundred hours."

A high %CPU (Percent of CPU usage) would be required by legacy tools to handle the derived sensekeys.

Legacy tools will have to adapt one day ! If only to handle UTF-8.

Otherwise I don' t mind closing the issue but the issue will remain.

I tested and found a solution in #1125. The key point is that you split on the last %. This avoids any ambiguity and keeps compatibility with all formats, including with : and Unicode characters.