Open vtempest opened 4 months ago
We already use a YAML format internally that is convertable to JSON easily and the data is available as JSON on the web interface:
https://en-word.net/json/lemma/autodidact
Perhaps you could be more specific about what you want that is not already delivered?
It is not that easy to convert, needs various xml parsers and deep knowledge of the schema like oewn @tags etc. My 300 line file is the needed importer to get a json file that is usable within JS web apps. The other formats are not directly importable into Javascript and python which are the dominant languages for all these NLP apps.
Example of schema:
const processedLexicalEntry = LexicalEntry.map((lex) => ({
writtenForm: lex.Lemma["@_writtenForm"],
default_pos: lex.Lemma["@_partOfSpeech"],
senses: Array.isArray(lex.Sense)
? lex.Sense.map((lex_s) =>
parseInt(lex_s["@_synset"].replace("oewn-", ""))
)
: [parseInt(lex.Sense["@_synset"].replace("oewn-", ""))],
}));
processedLexicalEntry.forEach((lex) => {
dictionaryObj[lex.writtenForm.toLowerCase()] = {
defs: lex.senses,
pos: lex.default_pos,
};
});
const processedSynset = Synset.map((s) => ({
id: parseInt(s["@_id"]?.replace(/oewn-/g, "")),
def: s.Definition?.Definition,
example: s.Example?.Example,
synonyms: s["@_members"]
.replace(/oewn-/g, "")
?.split(" ")
.map((syn) => syn.replace(/-.$/g, "").replace(/_/g, " "))
.join(", "),
pos: s["@_partOfSpeech"],
cat: categories.indexOf(s["@_lexfile"]),
}));
Does that look simple and intuitive? No. It took days of work to grok the schema and make it simple for json and work without errors. Everyone has to replicate these steps.
I think this is the reason that we provide a Javascript interface at https://en-word.net/ is precisely to allow these kinds of use cases.
For example, this JS fiddle is a simple (if not great) app that looks up the definition of a word using our API
https://jsfiddle.net/vkjd0x9L/
I can assure you that the YAML version of the source is very easy to work with in Python and there are libraries such a @goodmami's wn for working with the XML releases.
Right but some people need a JSON single data file they can modify and reuse in an app. Cannot call the remote API not efficient - other than to scrape everything which is just a waste of bandwidth when the json can be provided.
There is no documentation as to the schema of yaml oewn file so it is not "very easy to work with" and I spent days wadding thru it to understand the synset @id's etc. before I could get a workable data type to be used in a JS web app. We should provide this since otherwise everyone has to do these conversions.
And another reason yaml is not good is it is split up, should be all in one. neither yaml or xml is directly usable, all have to be converted. At least provide datatypes ie in typescript or jsdoc for each TermEntry
Okay, so I see two requests here:
https://airesearch.wiki/functions/src_dataset_import_dictionary_import.importDictionary.html It is done in typedoc. Better typedef
Please help if you can.... my code shows 151k but it's supposed to be 160k terms. Where am I in error in the schema reading.
Your link does not seem to work.
I am not sure why there would be a discrepancy in the number of terms.
25MB JSON: https://raw.githubusercontent.com/vtempest/wiki-phrase-tokenizer/master/data/dictionary-152k.json
Example
Script to download, decompress, parse and process into JSON (300 lines) https://github.com/vtempest/wiki-phrase-tokenizer/blob/master/src/dataset-import/dictionary-import.js
Original PR: https://github.com/globalwordnet/english-wordnet/pull/1029
The exact format is up to you. The importer is highly customizable and we can do all synonyms and etc that is needed. JSON is the best as it is needed for web apps / Javascript which is the majority. In other words, we can create a 120mb json that is lossless for a certain type of use case, and allow compression and specific attribute selection to create a JSON which is vital for web apps. There seems to be no reason not to support JSON given how that is what is needed to make it useful in AI search and apps.
Another advantage of the JSON Prefix Trie is O(1) lookups instead of having to loop through the index each time. There is "unanimous consensus" that this feature alone makes it better than any other data type for storing dict data. Source: https://johnresig.com/blog/javascript-trie-performance-analysis/