Closed drewvid closed 1 month ago
Could you provide some more information about the environment that this is occurring in?
The correct way to understand sense keys is to first split the field by %
and then by :
as described here.
In this case an identifier is valid if it has 4 :
after the %
, which these identifiers follow.
EWE and OEWN diverge in how to generate such a sensekey since #1fbd3c1b198270f000eaaa796380e6a1bd221f91
OEWN def get_sense_key() has (line 124) this:
lemma = (e.lemma.written_form
.replace(" ", "_")
.replace("&apos", "'")
.replace(":", "-cn-")
.lower())
while EWE pub fn get_sense_key() has (line 157):
let lemma = lemma.replace(" ", "_").replace("&apos", "'").to_lowercase();
But indeed the sensekey is the rogue one if you split with re.split('[%:]', sk).
I think we should avoid sensekeys with more than one percent and more than 4 colons. The referenced code above does not consider the head lemma, which itself may have % and : , if we accept lemmas in sensekeys to have them unescaped. This may lead to unparsable sensekeys, which goes against the original intention.
This one rogue lemma with a colon in it has created so many headaches!
The Princeton WordNet command line tools seem to work with the most recent release
-> % wordnet "Capital: Critique of Political Economy" -synsn
Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun capital:_critique_of_political_economy
1 sense of capital: critique of political economy
Sense 1
Capital: Critique of Political Economy, Das Kapital, Capital
INSTANCE OF=> book
They interpret lemmas by simply lowercasing everything and replacing spaces with underscore. I would call this the 'reference implementation' of the WNDB format, i.e., if it works with the command line tool, then it is a correct implementation. Thus, an unescaped :
in the lemma is correct. (This of course means we can never have a lemma with a %
in it).
The issue with escaping the :
would be that you would not be able to use the original Princeton tools to look for that lemma as they do not implement any escaping.
EWE and OEWN diverge in how to generate such a sensekey since #1fbd3c1b198270f000eaaa796380e6a1bd221f91
OEWN def get_sense_key() has (line 124) this:
lemma = (e.lemma.written_form .replace(" ", "_") .replace("&apos", "'") .replace(":", "-cn-") .lower())
while EWE pub fn get_sense_key() has (line 157):
let lemma = lemma.replace(" ", "_").replace("&apos", "'").to_lowercase();
But indeed the sensekey is the rogue one if you split with re.split('[%:]', sk).
This is in fact not a bug. EWE is reading the sense key from YAML (capital:_critique_of_political_economy%1:10:01::
), where as this repo's validation is reading from XML where the :
has been escaped so that it is a valid XML identifier (oewn-capital-cn-_critique_of_political_economy__1.10.01..
). I will make a small documentary fix to make this clearer.
EWE and OEWN diverge in how to generate such a sensekey since #1fbd3c1b198270f000eaaa796380e6a1bd221f91 OEWN def get_sense_key() has (line 124) this:
lemma = (e.lemma.written_form .replace(" ", "_") .replace("&apos", "'") .replace(":", "-cn-") .lower())
while EWE pub fn get_sense_key() has (line 157):
let lemma = lemma.replace(" ", "_").replace("&apos", "'").to_lowercase();
But indeed the sensekey is the rogue one if you split with re.split('[%:]', sk).
This is in fact not a bug. EWE is reading the sense key from YAML (
capital:_critique_of_political_economy%1:10:01::
), where as this repo's validation is reading from XML where the:
has been escaped so that it is a valid XML identifier (oewn-capital-cn-_critique_of_political_economy__1.10.01..
). I will make a small documentary fix to make this clearer.
Turns it out it was a bug... or more accurately a bug incorrectly fixed. See #1124
I wrote my own code to parse the entries in the worknet3.1 database using the sense indexes to perform a seek. When reading in the index.sense file, I was splitting each line with the ":" character. I guessed I would have to change the way I do this, so thanks, everyone, for putting me straight. Problem solved :) Some links to my online apps where I use WordNet are: 1) https://realtimetools.co.uk/wnbrowser 2) https://realtimetools.co.uk/wordplay 3) https://realtimetools.co.uk/connect
As there was only one entry with an extra ":" character, I was simply ignoring it, but now I know I should change the way I parse each entry.
@1313ou are you okay to close as wontfix
?
Sure, I just needed to catch up with the times! Nothing needs fixing!
Thanks for explaining how I should update my code.
I agree that the Python function is called in validation, not generation.
But we have to consider that if we accept a YAML sensekey with unescaped % and : we will will run into problems sooner than later. In fact I'm 100% sure. BTW will we have a oewn-100%%4:0:0:fully:00 ?
Examples with colons would involve ratios: S:N (signal to noise) , used in signal processing and electronics w:v (weight to volume ratio) HDL:LDL Ratio of High-Density Lipoprotein to Low-Density Lipoprotein, used in medical science. C:N ratio Carbon to Nitrogen ratio, important in soil science and composting. V:H ratio Vertical to Horizontal ratio, used in engineering and construction. I:O ratio Input to Output ratio, used in various fields including economics and computer science.
Non-ratios examples: 11:11 Often considered a "magic" or "wishing" time, associated with numerology and superstition. 4:20 Associated with cannabis culture, used as a code for consumption or as a time for related events. 15:00 In military and aviation contexts, often pronounced "fifteen hundred hours."
A high %CPU (Percent of CPU usage) would be required by legacy tools to handle the derived sensekeys.
Legacy tools will have to adapt one day ! If only to handle UTF-8.
Otherwise I don' t mind closing the issue but the issue will remain.
I agree that the Python function is called in validation, not generation.
But we have to consider that if we accept a YAML sensekey with unescaped % and : we will will run into problems sooner than later. In fact I'm 100% sure. BTW will we have a oewn-100%%4:0:0:fully:00 ?
Examples with colons would involve ratios: S:N (signal to noise) , used in signal processing and electronics w:v (weight to volume ratio) HDL:LDL Ratio of High-Density Lipoprotein to Low-Density Lipoprotein, used in medical science. C:N ratio Carbon to Nitrogen ratio, important in soil science and composting. V:H ratio Vertical to Horizontal ratio, used in engineering and construction. I:O ratio Input to Output ratio, used in various fields including economics and computer science.
Non-ratios examples: 11:11 Often considered a "magic" or "wishing" time, associated with numerology and superstition. 4:20 Associated with cannabis culture, used as a code for consumption or as a time for related events. 15:00 In military and aviation contexts, often pronounced "fifteen hundred hours."
A high %CPU (Percent of CPU usage) would be required by legacy tools to handle the derived sensekeys.
Legacy tools will have to adapt one day ! If only to handle UTF-8.
Otherwise I don' t mind closing the issue but the issue will remain.
I tested and found a solution in #1125. The key point is that you split on the last %
. This avoids any ambiguity and keeps compatibility with all formats, including with :
and Unicode characters.
Release format Delete as required: WNDB
Describe the bug One of the entries in the index.sense file has six fields instead of 5: capital:_critique_of_political_economy%1:10:01:: 07423067 1 0
To Reproduce Steps to reproduce the behavior: Compare the number of fields in these two entries: line 28564 -> capital : _critique_of_political_economy%1 : 10 : 01 : : 07423067 1 0 line 28565 -> capital_account%1 : 21 : 01 : : 13565485 2 0
Expected behavior I would expect the number of fields to be 5