tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
811 stars 84 forks source link

sense tags #753

Closed 43-21 closed 2 months ago

43-21 commented 2 months ago

I have a couple of questions regarding sense tags. I understand that the alt-of tag is added whenever there's an alt_of field. What about "alternative" - what's the difference in meaning, here? I'd also like to know if "dated" is only used in reference to spelling, or if it can refer to the use of the word itself, too. Are there other tags which don't appear in the wiktionary entry but do in the extracted JSONL file? Lastly, would it be possible to add more tags to differentiate between different spelling alternatives? For example, pre-1918 and е/ё for Russian.

xxyzz commented 2 months ago

"alternative" probably added from pages use "alternative spelling of" template, "dated" from pages use "dated form of" template.

They could also be converted from similar texts defined in the long tags.py file, this file has the most rules to convert tags, read the file requires patience...

kristian-clausal commented 2 months ago

We can only create data from text that is already there (or in some cases create special code for constructs we already know), so long as the "e/ë" distinction is mentioned in text in a consistent form that can be added to tags.py, for example, it should be possible.

43-21 commented 2 months ago

I've looked at tags.py and feel like there could be some improvements - I think it's important to differentiate between dated words and dated spellings. I can create a pull request for this if it isn't more effort to review than to do it yourself - maybe a dated-spelling tag would be appropriate? Is the distinction between dated spellings and obsolete spellings important? To be honest, I haven't really looked at the data you're working with when extracting. The information for the e/ë distinction should be listed as a category, which seems to be lost in the extraction. These things also seem to be tagged in the wiktionary edit field but I'm guessing you don't have access to that data.

I do have one last question - is splitting tags by source something you'd consider (i.e. the tags that get cut out when processing the glosses from the raw_glosses, and those that don't)? I think this might make it a little cleaner and, I won't lie, more suitable for my purposes, but I understand if you don't want to do this - after all, it's perfectly possible to differentiate these myself after the extraction & post-processing.

kristian-clausal commented 2 months ago

We can add dated-spelling as a tag, the problem is just determining what the text "dated" means in context. The tagging system doesn't have much context.

We only process the article text.

43-21 commented 2 months ago

I've thought some more about this and I see now that the data from wiktionary is flawed, making it pointless to change the tags in this case. However, categories might help here - I'm not sure why some categories such as Russian terms spelled with Е instead of Ё are not added.

xxyzz commented 2 months ago

I check the page актер, it has "Russian terms spelled with Е instead of Ё" category but it's inside the "senses" list. Could you post the pages missing this category?

43-21 commented 2 months ago

трехногий, for example. But I looked at актер, too, it has the same problem. The categories are extracted properly, but most of them get lost during post-processing. Unless that's intended?

kristian-clausal commented 2 months ago

Don't bother using the post-processed data, the raw data is what you get directly out of wiktwords. The post-processing is done in the kaikki.org's own repo for... Actually, I think it's just display purposes. I'll have to ask Tatu about this.

43-21 commented 2 months ago

I understand. It's a shame, as the raw data only seems to be available for the entire english wiktionary, and downloading it from kaikki.org is much more convenient than running the script myself. I'll have to see how to proceed. Thank you!

kristian-clausal commented 2 months ago

Each page has the raw data in the widget underneath the processed data... Huh, I hadn't realized there's no download link for the raw data. I'll ask Tatu about it. For singular projects, I suggest using the entire data, of course (it's easier to download the whole page than thousands of small ones) because it's already got all the data.

kristian-clausal commented 2 months ago

I finally had the chance to ask Tatu about this (that is, Tatu was available when I remembered this issue), and the post-processed stuff seems like something he tried out at the start and wanted to expand on, but which is not vital to anything. Instead of creating millions of small files that might cause problems on the kaikki.org server of almost duplicate data, he ok'd actually just changing the download links to raw data instead of the post-processed data.

The end-user difference shouldn't be huge, so we'll not make a big show out of it.

kristian-clausal commented 2 months ago

I've changed the kaikki.org HTML generation so that on each word page, the download link goes to the raw wiktextract output JSONL.

However, for any links of collected words on other pages (which are now marked as "postprocessed" in text), this is not the case; these downloads are generated by a surprisingly complex process that I gave up on trying to replace, because it would have decoupled what was listed on the page and what was actually in the download (the raw download would have probably had more stuff in it, because the raw data is just clumpier).

Word pages are not separated into new pages by sense, so each page has word for language, which we already have access to easily without needing to create new systems to make annoying indexes for sense-id-to-raw-data.

So, raw downloads are: One for the whole dictionary, found in the Raw Downloads page for the whole dictionary, and individual word pages containing all the entries for that page.

Then postprocessed data can be found for the whole dictionary (main link), and for different lists.

No data should be missing, it's just that if you really, really want the postprocessed data for a specific word (which you probably don't really need), then you need to download the postprocessed whole dictionary data.

I would have probably removed the postprocessed bigger downloads if I could have replaced them with the equivalent raw data instead, but it wasn't worth the effort to try to figure out the best way to make the coupling between the different kinds of data without knowing what was exactly going on.

Word-specific postprocessed downloads were removed because we already have problems with huge directories and numbers of files on the server.