unicode-org / inflection

code, data and documentation related to handling inflection problems
Other
2 stars 1 forks source link

Using Wikidata as a lexicon #9

Open nciric opened 6 months ago

nciric commented 6 months ago

In our first meeting we discussed various lexicon formats and the use cases for them. Wikidata already has flexible format, and passionate community contributing to it. We could bootstrap our effort by contributing to it, instead of starting a new lexicon under Unicode.

Before we settle on Wikidata we need to answer a couple of questions:

  1. Licensing? Is it compatible with our needs (slicing, using in products, converting to more compact format, adding custom/proprietary words).
  2. Filtering spam/abuse - data quality in general
  3. What are the tools to operate on the lexicon (slicing, adding custom/proprietary elements...)
grhoten commented 6 months ago

As a part of discussing this point, I'd like to hear how the data is structured. If it's a collection of unannotated words without relationships, it's not that helpful. If it has annotations for a given word and all of the grammeme properties for the other surface forms of a given word, that would be helpful.

For example, take the Finnish word for numeraali. It has a nicely formatted declension table that is easy to read. The template for the declension table is simply "{{fi-decl-risti|numeraal|||a}}". That makes it easy to format a table, but it makes it hard to parse the data without infrastructure to execute the code behind the template. That template format makes it hard to generate the other surface forms and to deduce the grammatical properties of each form. Some of the cell entries don't even have page entries. So you have to go by what is in the table.

It's also worth pointing out that Wiktionary tends to put in optional stress markers in the declension tables for several languages, like Russian and Lithuanian. When you go to the actual Wiktionary page for a word, the stress markers are missing. These optional stress markers are helpful for pronunciation, but they're rarely written outside of an elementary school setting.

Clarity around the word relationships and properties in the data would be helpful to understand.

macchiati commented 6 months ago

It would be wikidata, not wiktionary (which doesn't have the right license).

So check out https://www.wikidata.org/wiki/Q63116. For that particular term, they don't seem to have declensions.

On Wed, Mar 13, 2024 at 10:49 AM George Rhoten @.***> wrote:

As a part of discussing this point, I'd like to hear how the data is structured. If it's a collection of unannotated words without relationships, it's not that helpful. If it has annotations for a given word and all of the grammeme properties for the other surface forms of a given word, that would be helpful.

For example, take the Finnish word for numeraali https://en.wiktionary.org/wiki/numeraali. It has a nicely formatted declension table that is easy to read. The template for the declension table is simply "{{fi-decl-risti|numeraal|||a}}". That makes it easy to format a table, but it makes it hard to parse the data without infrastructure to execute the code behind the template. That template format makes it hard to generate the other surface forms and to deduce the grammatical properties of each form. Some of the cell entries don't even have page entries. So you have to go by what is in the table.

It's also worth pointing out that Wiktionary tends to put in optional stress markers in the declension tables for several languages, like Russian and Lithuanian. When you go to the actual Wiktionary page for a word, the stress markers are missing. These optional stress markers are helpful for pronunciation, but they're rarely written outside of an elementary school setting.

Clarity around the word relationships and properties in the data would be helpful to understand.

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/inflection/issues/9#issuecomment-1995163581, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMFXQTILCQXUY7JKIM3YYCGTNAVCNFSM6AAAAABESUVWSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJVGE3DGNJYGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

grhoten commented 6 months ago

Yes, I agree that the license for Wiktionary is not ideal. It is helpful to reference for illustrative purposes for problems at hand, and they're both a part of Wikimedia.

Wikidata does seem helpful for finding translations and synonyms of terms. I'm less clear on whether declensions exist at all in Wikidata. If it does exist, I'd like to see an example, and hopefully it's structured in a more parseable way than Wiktionary.

vrandezo commented 6 months ago

Besides the item for numeral ( Q63116 ) as mentioned by @macchiati there are also 31 lexemes that have this item as a sense: query results for the Lexemes. We don't have one in Finnish for numeraali, unfortunately, but we have an entry for the Estonian numeraal, L375630 (note, Lexeme identifiers start with L, and item identifiers with Q).

Roughly, items are the ontological things, and lexemes are the words. Each Lexeme is in a specific language, whereas the items are supposed to be language independent. Each lexeme can have 0 or more senses, and the sense can refer to an item. This way we can have a SPARQL query that asks for all lemmas on the lexemes that have a sense pointing to a given item, such as the item for numeral.

As you can see on the page for numeraal, L375630, this is all structured data. All the data can also be downloaded as JSON or as RDF. A SPARQL endpoint allows to query the data.

Regarding the questions in the OP:

  1. Licensing? Is it compatible with our needs (slicing, using in products, converting to more compact format, adding custom/proprietary words).

All data in Wikidata is available under CC-0.

  1. Filtering spam/abuse - data quality in general

Wikidata has a healthy community, and has seen so far 500,000+ contributors. It is the most edited wiki in the world.

  1. What are the tools to operate on the lexicon (slicing, adding custom/proprietary elements...)

The data can be downloaded in bulk, have structured query using SPARQL, or per individual Lexeme and even more fine-grained. Editing is possible on-wiki with the community, or it can be enriched locally.

Happy to answer any more questions!

macchiati commented 6 months ago

Thanks Denny!

On Wed, Mar 13, 2024 at 8:34 PM Denny Vrandečić @.***> wrote:

Besides the item for numeral ( Q63116 https://www.wikidata.org/wiki/Q63116 ) as mentioned by @macchiati https://github.com/macchiati there are also 31 lexemes that have this item as a sense: query results for the Lexemes https://w.wiki/9TUu. We don't have one in Finnish for numeraali, unfortunately, but we have an entry for the Estonian numeraal, L375630 https://www.wikidata.org/wiki/Lexeme:L375630 (note, Lexeme identifiers start with L, and item identifiers with Q).

Roughly, items are the ontological things, and lexemes are the words. Each Lexeme is in a specific language, whereas the items are supposed to be language independent. Each lexeme can have 0 or more senses, and the sense can refer to an item. This way we can have a SPARQL query that asks for all lemmas on the lexemes that have a sense pointing to a given item, such as the item for numeral.

As you can see on the page for numeraal, L375630 https://www.wikidata.org/wiki/Lexeme:L375630, this is all structured data. All the data can also be downloaded as JSON https://www.wikidata.org/wiki/Special:EntityData/L375630.json or as RDF https://www.wikidata.org/wiki/Special:EntityData/L375630.rdf. A SPARQL endpoint https://query.wikidata.org allows to query the data.

Regarding the questions in the OP:

  1. Licensing? Is it compatible with our needs (slicing, using in products, converting to more compact format, adding custom/proprietary words).

All data in Wikidata is available under CC-0.

  1. Filtering spam/abuse - data quality in general

Wikidata has a healthy community, and has seen so far 500,000+ contributors. It is the most edited wiki in the world.

  1. What are the tools to operate on the lexicon (slicing, adding custom/proprietary elements...)

The data can be downloaded in bulk, have structured query using SPARQL, or per individual Lexeme and even more fine-grained. Editing is possible on-wiki with the community, or it can be enriched locally.

Happy to answer any more questions!

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/inflection/issues/9#issuecomment-1996326319, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMGTVHFAAT57P4BVWC3YYELDRAVCNFSM6AAAAABESUVWSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJWGMZDMMZRHE . You are receiving this because you were mentioned.Message ID: @.***>

grhoten commented 6 months ago

we have an entry for the Estonian numeraal, L375630 (note, Lexeme identifiers start with L, and item identifiers with Q).

Ooh! That seems interesting. We might be able to use that. It's really good to know how lexemes are referenced.

The data can be downloaded in bulk

Is this one of those locations? https://dumps.wikimedia.org/wikidatawiki/

FYI a recent bz2 version of it is 143 GB for reference, but I suspect that we just want to filter out the non-lexeme stuff.

vrandezo commented 6 months ago

If you go here

https://dumps.wikimedia.org/wikidatawiki/entities/

you can find the dump of only the Lexemes (the files named latest-lexemes..). That is, depending on the format, between 0.3-1.1 GB zipped. The references to the items in Wikidata would not be in, though.