omwn / omw-data

This packages up data for the Open Multilingual Wordnet
43 stars 3 forks source link

PWN: Split gloss lines into definition and examples #10

Closed goodmami closed 3 years ago

goodmami commented 3 years ago

Unlike the Open English WordNet, the LMF versions of the PWN 3.0 and 3.1 packaged here do not split the examples and definition from the WNDB gloss line. Furthermore, they over-escape some things, such as quote characters in element text:

<Definition >in a sidearm manner; &quot;he prefers to throw sidearm&quot;</Definition>

Splitting the definition and examples from the WNDB gloss line is not trivial as there are only conventions about how to delimit them and there is plenty of variation, including typos and other problems. Please see this issue in the NLTK for a discussion of the problem space: https://github.com/nltk/nltk/issues/2527

For OMW we could do a bit better than the current situation, but we should leave actual corrections in the data to the OEWN. That is, while splitting the examples from definitions helps us model the data better in LMF, we should not change the data.

fcbond commented 3 years ago

Hi,

I have a collection of definitions and examples for PWN 3.0 where the splitting has been fixed by hand (back at NTT!). However, we have also made other (minor) corrections of typos, such as 'wound' for 'mound' and so forth. I don't see much point in spending time in fixing one and not the other.

There are here: http://compling.hss.ntu.edu.sg/wnja/data/1.1/wnjpn-def.tab.gz http://compling.hss.ntu.edu.sg/wnja/data/1.1/wnjpn-exe.tab.gz

The first release should have fewer fixes: http://compling.hss.ntu.edu.sg/wnja/data/1.0/wnjpn-def.tab.gz http://compling.hss.ntu.edu.sg/wnja/data/1.0/wnjpn-exe.tab.gz

I am not sure if you can see these outside of NTU, if you can't I can send them to you (or move them somewhere else).

What do you think about using these?

Yours,

On Wed, Sep 22, 2021 at 7:55 AM Michael Wayne Goodman < @.***> wrote:

Unlike the Open English WordNet, the LMF versions of the PWN 3.0 and 3.1 packaged here do not split the examples and definition from the WNDB gloss line. Furthermore, they over-escape some things, such as quote characters in element text:

in a sidearm manner; "he prefers to throw sidearm"

Splitting the definition and examples from the WNDB gloss line is not trivial as there are only conventions about how to delimit them and there is plenty of variation, including typos and other problems. Please see this issue in the NLTK for a discussion of the problem space: nltk/nltk#2527 https://github.com/nltk/nltk/issues/2527

For OMW we could do a bit better than the current situation, but we should leave actual corrections in the data to the OEWN. That is, while splitting the examples from definitions helps us model the data better in LMF, we should not change the data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bond-lab/omw-data/issues/10, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRRG2DZ3KCW262IPIX3UDEJX7ANCNFSM5EP6J3KA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 3 years ago

I can see the files outside of NTU.

I'm not sure if it's best to use them, however. For a better, less buggy English wordnet, there's the OEWN, which presumably has incorporated these fixes already. I kinda think our wn30 and wn31 distributions should be as close to the original databases as possible, preserved as historical record. That said, if it were just corrected definition/example splits, that might be useful, as that's how LMF models the data, and I'm currently looking at how to do this automatically.