Feature: Export as JSON

vtempest commented 4 months ago

25MB JSON: https://raw.githubusercontent.com/vtempest/wiki-phrase-tokenizer/master/data/dictionary-152k.json

Example


 "ability": {
    "cat": 7,
    "defs": [
      "the quality of being able to perform; a quality that permits or facilitates achievement or accomplishment",
      "possession of the qualities (especially mental qualities) required to do something or get something done Example: danger heightened his powers of discrimination"
    ],
    "pos": "n",
    "syns": "power"
  },

Script to download, decompress, parse and process into JSON (300 lines) https://github.com/vtempest/wiki-phrase-tokenizer/blob/master/src/dataset-import/dictionary-import.js

Original PR: https://github.com/globalwordnet/english-wordnet/pull/1029

The exact format is up to you. The importer is highly customizable and we can do all synonyms and etc that is needed. JSON is the best as it is needed for web apps / Javascript which is the majority. In other words, we can create a 120mb json that is lossless for a certain type of use case, and allow compression and specific attribute selection to create a JSON which is vital for web apps. There seems to be no reason not to support JSON given how that is what is needed to make it useful in AI search and apps.

Another advantage of the JSON Prefix Trie is O(1) lookups instead of having to loop through the index each time. There is "unanimous consensus" that this feature alone makes it better than any other data type for storing dict data. Source: https://johnresig.com/blog/javascript-trie-performance-analysis/

jmccrae commented 3 months ago

We already use a YAML format internally that is convertable to JSON easily and the data is available as JSON on the web interface:

https://en-word.net/json/lemma/autodidact

Perhaps you could be more specific about what you want that is not already delivered?

vtempest commented 3 months ago

It is not that easy to convert, needs various xml parsers and deep knowledge of the schema like oewn @tags etc. My 300 line file is the needed importer to get a json file that is usable within JS web apps. The other formats are not directly importable into Javascript and python which are the dominant languages for all these NLP apps.

Example of schema:

 const processedLexicalEntry = LexicalEntry.map((lex) => ({
    writtenForm: lex.Lemma["@_writtenForm"],
    default_pos: lex.Lemma["@_partOfSpeech"],
    senses: Array.isArray(lex.Sense)
      ? lex.Sense.map((lex_s) =>
          parseInt(lex_s["@_synset"].replace("oewn-", ""))
        )
      : [parseInt(lex.Sense["@_synset"].replace("oewn-", ""))],
  }));

  processedLexicalEntry.forEach((lex) => {
    dictionaryObj[lex.writtenForm.toLowerCase()] = {
      defs: lex.senses,
      pos: lex.default_pos,
    };
  });

  const processedSynset = Synset.map((s) => ({
    id: parseInt(s["@_id"]?.replace(/oewn-/g, "")),
    def: s.Definition?.Definition,
    example: s.Example?.Example,
    synonyms: s["@_members"]
      .replace(/oewn-/g, "")
      ?.split(" ")
      .map((syn) => syn.replace(/-.$/g, "").replace(/_/g, " "))
      .join(", "),
    pos: s["@_partOfSpeech"],
    cat: categories.indexOf(s["@_lexfile"]),
  }));

Does that look simple and intuitive? No. It took days of work to grok the schema and make it simple for json and work without errors. Everyone has to replicate these steps.

jmccrae commented 3 months ago

I think this is the reason that we provide a Javascript interface at https://en-word.net/ is precisely to allow these kinds of use cases.

For example, this JS fiddle is a simple (if not great) app that looks up the definition of a word using our API

https://jsfiddle.net/vkjd0x9L/

I can assure you that the YAML version of the source is very easy to work with in Python and there are libraries such a @goodmami's wn for working with the XML releases.

vtempest commented 3 months ago

Right but some people need a JSON single data file they can modify and reuse in an app. Cannot call the remote API not efficient - other than to scrape everything which is just a waste of bandwidth when the json can be provided.

There is no documentation as to the schema of yaml oewn file so it is not "very easy to work with" and I spent days wadding thru it to understand the synset @id's etc. before I could get a workable data type to be used in a JS web app. We should provide this since otherwise everyone has to do these conversions.

And another reason yaml is not good is it is split up, should be all in one. neither yaml or xml is directly usable, all have to be converted. At least provide datatypes ie in typescript or jsdoc for each TermEntry

jmccrae commented 3 months ago

Okay, so I see two requests here:

Release JSON data as a complete database (alongside existing XML, RDF and WNDB)
Document the data structure used in our YAML (possibly with TypeScript)

vtempest commented 3 months ago

https://airesearch.wiki/functions/src_dataset_import_dictionary_import.importDictionary.html It is done in typedoc. Better typedef

Please help if you can.... my code shows 151k but it's supposed to be 160k terms. Where am I in error in the schema reading.

jmccrae commented 3 months ago

Your link does not seem to work.

I am not sure why there would be a discrepancy in the number of terms.

globalwordnet / english-wordnet

Feature: Export as JSON #1031