stephenmk / Jitendex

A free, offline, and openly licensed Japanese-to-English dictionary. Updates weekly!
https://jitendex.org
Creative Commons Attribution Share Alike 4.0 International
248 stars 2 forks source link

Example Sentences (from Tanaka Corpus) sometimes linked to incorrect sense (ex: in 掛ける entry) #79

Closed Kimeiga closed 6 months ago

Kimeiga commented 6 months ago

Hi Stephen! Amazing work, thank you for contributing to the world's knowledge!

I have noticed some issues with the Tanaka Corpus, and am not sure where to discuss this, but since I intend to use Yomitan as my popup dictionary of choice for some time, figured I would mention it here. This problem comes up in other projects that use the Tanaka Corpus of course (e.g. Shirabe Jisho for iOS).

If you look up the dictionary entry for 掛ける in Jitendex, there are many examples of sentences from Tanaka Corpus being assigned to the wrong sense.

image

sense 9 means multiply.

image

But multiply sentence is included with sense 5.

sense 11 means take a seat, and includes the correct reference

image

but the example sentence is with sense 22, to apply (insurance)

image

Is there anything that can be done about this?

I read on the EDRDG wiki that the Tanaka Corpus is now within Tatoeba and it is its new "home". Does this mean each time we see something like this, we should correct it there?

Here's one of those sentences:

https://tatoeba.org/en/sentences/show/236991

I have an account with Tatoeba, but I'm afraid I don't know how to edit the sentences, and even if I did, would I be able to change the attribution information that links it to one of the senses in the jmdict?

Just bringing this to your attention in case it is not possible to change things at the source (the Tanaka Corpus itself) and we might need to make a file in Jitendex for all the manually assigned corrections or something.

Kimeiga commented 6 months ago

image

image

One interesting observation is that i believe the example sentences are misassigned by entire groups at a time. In shirabe jisho it is clear that all the multiplication kakerus are with the spend time kakerus and all the sit down kakerus are mixed with the insurance kakerus.

Kimeiga commented 6 months ago

After some research, I'm not sure but I suspect the reason this may have happened is because entries to the jmdict have been removed and others have been added and perhaps this contributed to a bunch of off by 1 errors over time that have shifted these example sentence groups around

https://www.edrdg.org/jmdict_edict_list/2021/msg00083.html

Kimeiga commented 6 months ago

Another thought is to your point on #37 jreibun might come out soon and be a better source of sentences than tanaka corpus anyways, albeit not sure when it will be released

stephenmk commented 6 months ago

Hi Stephen! Amazing work, thank you for contributing to the world's knowledge!

Thanks, I'm always glad to hear that people like the project.

If you look up the dictionary entry for 掛ける in Jitendex, there are many examples of sentences from Tanaka Corpus being assigned to the wrong sense.

Yes, these errors are very common. I have probably fixed a couple hundred of them over the past year.

I have an account with Tatoeba, but I'm afraid I don't know how to edit the sentences

Tatoeba has a very primitive GUI for editing the links to JMdict entries. It is technically open to the public to use, but it is extremely user-unfriendly and difficult to use correctly.

Feel free to let me know when you spot these errors and I'll go fix them. A couple of other users have also been reporting these errors to me in the discussion forum.

After some research, I'm not sure but I suspect the reason this may have happened is because entries to the jmdict have been removed and others have been added and perhaps this contributed to a bunch of off by 1 errors over time that have shifted these example sentence groups around

That is indeed a common reason for the errors. Whenever entries in JMdict are edited, the editors need to remember to update the sentence links as well. We try to keep this in mind, but sometimes we forget. I recently suggested that some of this sentence information should be displayed in the JMdict database editor to make it easier to remember, but this is a volunteer project and things don't always move quickly.

Another thought is to your point on #37 jreibun might come out soon and be a better source of sentences than tanaka corpus anyways, albeit not sure when it will be released

It's been almost a year since the last public update from that project, so I'm not sure how soon that will be. Fingers crossed.