moreymat / omw-graph

The Open Multilingual Wordnet in a graph database
MIT License
4 stars 0 forks source link

Import the original Wordnets #2

Open moreymat opened 10 years ago

moreymat commented 10 years ago

Each original Wordnet, for example the WOLF (Wordnet Libre du Français), contains its own language-specific structure. This structure is very valuable information that we want to import into the graph database.

As each Wordnet is distributed in its own format, we need one import function per Wordnet. The OMW team had the same need. They provide one script per Wordnet that retrieves the aligned data from the original files.

The idea is to transform each of the OMW import scripts into a function, expand each function to import more information (including structure) and wrap all functions in a module.

moreymat commented 10 years ago

A first step would be to try and do this on the WOLF and see how it goes.

rhin0cer0s commented 10 years ago

So do we have to build another form of .tab files which would likely go like this ?

ID-TYPE \t LEMMA \t word \t synonym#synonym#... \t hyponyme#hyponyme#... \ etc ...

And our parser should be able to read it.

Or do we make it in different steps :

moreymat commented 10 years ago

I am sorry I was not very clear about this issue in the first place.

TL; DR: this issue belongs to a future milestone

Long version: We can build multilingual graphs in three different ways:

  1. get the nodes from OMW (tab files) and the edges from the Princeton Wordnet (via nltk),
  2. get the nodes from OMW and the edges from the original Wordnets,
  3. get the nodes and structure from the original Wordnets.

Solution (1) is our top priority at the moment: it should be quite cheap and enables us to look into data quickly.

This issue is about solutions (2) and (3), which are the next steps. These solutions are more costly but they produce a much richer graph than solution (1). The extra cost comes from having to parse the original Wordnet files. The scripts provided on the OMW page for each language already do that: they parse the original files, extract the lemmas, do some cleaning to ensure compatibility and output the result to .tab files. The idea for (2) and (3) is to expand these scripts to retrieve more information than the mere lemma (e.g. relations), do some cleaning and output the result to the db.

I will set up milestones to make the roadmap clearer :)

fcbond commented 10 years ago

G'day,

for (2) and (3), in the OMW we are strongly encouraging wordnet projects to output wordnet-LMF, we can then just have one parser to input that. Also, for some of the file I had to do some hand-cleaning as it was not easy to parse the original file.

In practice there will still be some issues: (i) dormant projects (like Hebrew and Albanian) --- we can make the LMF for them (I do already) (ii) there are more than one schema's in use for wordnet-LMF --- I hope we can encourage standardization by showing the benefit

Francis

moreymat commented 10 years ago

G'day @fcbond

We could produce wordnet-LMF as part of the conversion+import process, if :

  1. using, adapting or writing conversion scripts is not too much work,
  2. the wordnet-LMF files can be efficiently processed for batch insertion into the database.

Could you provide conversion scripts to test this approach on one or two wordnets? I would gladly accept any pull request :-)

Mathieu

fcbond commented 10 years ago

G'day, We could produce wordnet-LMF as part of the conversion+import process, if :

  1. using, adapting or writing conversion scripts is not too much work,

I attach the script I currently use to output LMF :-).

  1. the wordnet-LMF files can be efficiently processed for batch insertion into the database.

Could you provide conversion scripts to test this approach on one or two wordnets? I would gladly accept any pull request :-)

The Thai input is currently done from LMF, although not very generally:

!/usr/share/python

-- encoding: utf-8 --

#

Extract synset-word pairs from the Persian Wordnet

#

import sys import codecs import re

wnname = "Thai" wnlang = "tha" wnurl = "http://th.asianwordnet.org/" wnlicense = "wordnet"

#

header

# outfile = "wn-data-%s.tab" % wnlang o = codecs.open(outfile, "w", "utf-8" )

o.write("# %s\t%s\t%s\t%s \n" % (wnname, wnlang, wnurl, wnlicense))

#

Data is in the file tha-wn-1.0-lmf.xml

exploit the fact that the synset is the same as wn3.0 offset

f = codecs.open("tha-wn-1.0-lmf.xml", "r", "utf-8" )

sysnset = str() lemma = str() for l in f: m = re.search(r"<Lemma writtenForm=\"([^\"])\" part",l) if(m): lemma = m.group(1).strip() m = re.search(r"synset=\"tha-07-(.)\"",l) if(m): synset = m.group(1) o.write("%s\t%s\n" % (synset, lemma))

print "%s\t%s\n" % (synset, lemma)

Mathieu

— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/2#issuecomment-39949844 .

Sorry to mail it, I am not so used to git yet :-)

Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

moreymat commented 10 years ago

G'day,

Thanks for the information. There is no attached file though :-)

The students have started to include relations from wolf, thus included and extended fre2tab. Is it okay to have this extended version of your script in our github project? If so, how do you want your authorship to be acknowledged? Options include adding you to the list of authors of each extended script, adding you to the global list of contributors to the project, explicit mentions of omw as basis, any combination or variant of these or anything else you see fit.

Have a nice sunday,

Mathieu On 13 Apr 2014 06:36, "Francis Bond" notifications@github.com wrote:

G'day, We could produce wordnet-LMF as part of the conversion+import process, if :

  1. using, adapting or writing conversion scripts is not too much work,

I attach the script I currently use to output LMF :-).

  1. the wordnet-LMF files can be efficiently processed for batch insertion into the database.

Could you provide conversion scripts to test this approach on one or two wordnets? I would gladly accept any pull request :-)

The Thai input is currently done from LMF, although not very generally:

!/usr/share/python

-- encoding: utf-8 --

#

Extract synset-word pairs from the Persian Wordnet

#

import sys import codecs import re

wnname = "Thai" wnlang = "tha" wnurl = "http://th.asianwordnet.org/" wnlicense = "wordnet"

#

header

# outfile = "wn-data-%s.tab" % wnlang o = codecs.open(outfile, "w", "utf-8" )

o.write("# %s\t%s\t%s\t%s \n" % (wnname, wnlang, wnurl, wnlicense))

#

Data is in the file tha-wn-1.0-lmf.xml

exploit the fact that the synset is the same as wn3.0 offset

f = codecs.open("tha-wn-1.0-lmf.xml", "r", "utf-8" )

sysnset = str() lemma = str() for l in f: m = re.search(r"<Lemma writtenForm=\"([^\"])\" part",l) if(m): lemma = m.group(1).strip() m = re.search(r"synset=\"tha-07-(.)\"",l) if(m): synset = m.group(1) o.write("%s\t%s\n" % (synset, lemma))

print "%s\t%s\n" % (synset, lemma)

Mathieu

— Reply to this email directly or view it on GitHub< https://github.com/moreymat/omw-graph/issues/2#issuecomment-39949844> .

Sorry to mail it, I am not so used to git yet :-)

Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/2#issuecomment-40299422 .

fcbond commented 10 years ago

G'day,

Thanks for the information. There is no attached file though :-)

The students have started to include relations from wolf, thus included and extended fre2tab. Is it okay to have this extended version of your script in our github project? If so, how do you want your authorship to be acknowledged? Options include adding you to the list of authors of each extended script, adding you to the global list of contributors to the project, explicit mentions of omw as basis, any combination or variant of these or anything else you see fit.

Please (i) add me to the global list of contributors to the project. You already link to the OMW page in the Readme, which is enough.

Have a nice sunday,

You too.

Mathieu On 13 Apr 2014 06:36, "Francis Bond" notifications@github.com wrote:

G'day, We could produce wordnet-LMF as part of the conversion+import process, if :

  1. using, adapting or writing conversion scripts is not too much work,

I attach the script I currently use to output LMF :-).

  1. the wordnet-LMF files can be efficiently processed for batch insertion into the database.

Could you provide conversion scripts to test this approach on one or two wordnets? I would gladly accept any pull request :-)

The Thai input is currently done from LMF, although not very generally:

!/usr/share/python

-- encoding: utf-8 --

#

Extract synset-word pairs from the Persian Wordnet

#

import sys import codecs import re

wnname = "Thai" wnlang = "tha" wnurl = "http://th.asianwordnet.org/" wnlicense = "wordnet"

#

header

# outfile = "wn-data-%s.tab" % wnlang o = codecs.open(outfile, "w", "utf-8" )

o.write("# %s\t%s\t%s\t%s \n" % (wnname, wnlang, wnurl, wnlicense))

#

Data is in the file tha-wn-1.0-lmf.xml

exploit the fact that the synset is the same as wn3.0 offset

f = codecs.open("tha-wn-1.0-lmf.xml", "r", "utf-8" )

sysnset = str() lemma = str() for l in f: m = re.search(r"<Lemma writtenForm=\"([^\"])\" part",l) if(m): lemma = m.group(1).strip() m = re.search(r"synset=\"tha-07-(.)\"",l) if(m): synset = m.group(1) o.write("%s\t%s\n" % (synset, lemma))

print "%s\t%s\n" % (synset, lemma)

Mathieu

— Reply to this email directly or view it on GitHub< https://github.com/moreymat/omw-graph/issues/2#issuecomment-39949844> .

Sorry to mail it, I am not so used to git yet :-)

Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

— Reply to this email directly or view it on GitHub< https://github.com/moreymat/omw-graph/issues/2#issuecomment-40299422> .

— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/2#issuecomment-40301136 .

Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

moreymat commented 10 years ago

G'day Francis,

Adrien and Christophe noticed that you were now distributing XML files (in the LMF and lemon formats) on the OMW website at NTU.

Could you tell us how the LMF, lemon and tab files you provide compare content-wise? From a quick look at the english files, the LMF file contains relations between synsets, whereas the lemon file does not.

Is one of these formats (thinking LMF) complete and mature enough so that we can use your files as our only source of information to build the whole graph? FWIW, we could even stop depending on NLTK.

Mathieu

fcbond commented 10 years ago

G'day, Adrien and Christophe noticed that you were now distributing XML files (in the LMF and lemon formats) on the OMW website at NTU.

Could you tell us how the LMF, lemon and tab files you provide compare content-wise? From a quick look at the english files, the LMF file contains relations between synsets, whereas the lemon file does not.

That's right. LEMON is just the TAB files in very verbose XML :-). The assumption is that the ontology (wordnet) is separate. LMF should be complete (although I don't guarantee it).

Is one of these formats (thinking LMF) complete and mature enough so that

we can use your files as our only source of information to build the whole graph?

In theory LMF should be, in practice I generally add information to the database first, and then to the LMF.

FWIW, we could even stop depending on NLTK.

I think it is worth trying with LMF which we hope to be the format of the future.

Wordnet-LMF (and LEMON) don't have anywhere to record frequency counts (the idea is that they are from a corpus), although in practice they are useful :-).

Wait just a little though, as I seem to have lost the English and Japanese definitions in my move to svn (although we gained Greek).

Yours,

Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

moreymat commented 10 years ago

@fcbond thanks a lot. It seems we can give it a try.

@zorgulle @rhin0cer0s could you provide a rough estimate of how much work it would be to use the LMF XML files instead? If it is reasonable, we might try and do this before 0.1.

fcbond commented 10 years ago

G'day,

I have (finally) restored the English definitions and example so it should be good to go. There are also definitions for Albanian, Greek and Japanese :-).

On Fri, Apr 18, 2014 at 6:59 PM, Mathieu Morey notifications@github.comwrote:

@fcbond https://github.com/fcbond thanks a lot. It seems we can give it a try.

@zorgulle https://github.com/zorgulle @rhin0cer0shttps://github.com/rhin0cer0scould you provide a rough estimate of how much work it would be to use the LMF XML files instead? If it is reasonable, we might try and do this before 0.1.

— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/2#issuecomment-40801712 .

Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

moreymat commented 10 years ago

G'day Francis,

This is great, thank you ! I just noticed the OMW-LMF files provide exactly the information we wanted for milestone 0.1: aligned lemmas + relations from Princeton Wordnet. We will still have to scrape the original wordnets to retrieve their own structures for milestone 0.3, unless you plan to do that as well? :-)

fcbond commented 10 years ago

G'day,

This is great, thank you ! I just noticed the OMW-LMF files provide exactly the information we wanted for milestone 0.1: aligned lemmas + relations from Princeton Wordnet. We will still have to scrape the original wordnets to retrieve their own structures for milestone 0.3, unless you plan to do that as well? :-)

Not in the very near future. The next priority for me is adding confidence scores (manually verified or not) and corpus frequencies.

Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

rhin0cer0s commented 10 years ago

Hi @fcbond and thank you for your involvement !

@moreymat We build a little parser over the week-end ( we still have to fix some things before pushing it ) so the LMF 'support' is nearly done.

moreymat commented 10 years ago

Splendid! It would be great if we could release 0.1 this week. On 21 Apr 2014 10:30, "Christophe Guieu" notifications@github.com wrote:

Hi @fcbond https://github.com/fcbond and thank you for your involvement !

@moreymat https://github.com/moreymat We build a little parser over the week-end ( we still have to fix some things before pushing it ) so the LMF 'support' is nearly done.

— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/2#issuecomment-40923244 .

zorgulle commented 10 years ago

Hello, The lmf parser works, we can import word and relations, we tested it with English and French, we still have the index key length issue. we work on this problem. it might be solve soon

moreymat commented 10 years ago

OK great, I am looking forward for this.