Open moreymat opened 10 years ago
A first step would be to try and do this on the WOLF and see how it goes.
So do we have to build another form of .tab files which would likely go like this ?
ID-TYPE \t LEMMA \t word \t synonym#synonym#... \t hyponyme#hyponyme#... \ etc ...
And our parser should be able to read it.
Or do we make it in different steps :
I am sorry I was not very clear about this issue in the first place.
TL; DR: this issue belongs to a future milestone
Long version: We can build multilingual graphs in three different ways:
Solution (1) is our top priority at the moment: it should be quite cheap and enables us to look into data quickly.
This issue is about solutions (2) and (3), which are the next steps. These solutions are more costly but they produce a much richer graph than solution (1). The extra cost comes from having to parse the original Wordnet files. The scripts provided on the OMW page for each language already do that: they parse the original files, extract the lemmas, do some cleaning to ensure compatibility and output the result to .tab files. The idea for (2) and (3) is to expand these scripts to retrieve more information than the mere lemma (e.g. relations), do some cleaning and output the result to the db.
I will set up milestones to make the roadmap clearer :)
G'day,
for (2) and (3), in the OMW we are strongly encouraging wordnet projects to output wordnet-LMF, we can then just have one parser to input that. Also, for some of the file I had to do some hand-cleaning as it was not easy to parse the original file.
In practice there will still be some issues: (i) dormant projects (like Hebrew and Albanian) --- we can make the LMF for them (I do already) (ii) there are more than one schema's in use for wordnet-LMF --- I hope we can encourage standardization by showing the benefit
Francis
G'day @fcbond
We could produce wordnet-LMF as part of the conversion+import process, if :
Could you provide conversion scripts to test this approach on one or two wordnets? I would gladly accept any pull request :-)
Mathieu
G'day, We could produce wordnet-LMF as part of the conversion+import process, if :
- using, adapting or writing conversion scripts is not too much work,
I attach the script I currently use to output LMF :-).
- the wordnet-LMF files can be efficiently processed for batch insertion into the database.
Could you provide conversion scripts to test this approach on one or two wordnets? I would gladly accept any pull request :-)
The Thai input is currently done from LMF, although not very generally:
#
#
import sys import codecs import re
wnname = "Thai" wnlang = "tha" wnurl = "http://th.asianwordnet.org/" wnlicense = "wordnet"
#
# outfile = "wn-data-%s.tab" % wnlang o = codecs.open(outfile, "w", "utf-8" )
o.write("# %s\t%s\t%s\t%s \n" % (wnname, wnlang, wnurl, wnlicense))
#
f = codecs.open("tha-wn-1.0-lmf.xml", "r", "utf-8" )
sysnset = str() lemma = str() for l in f: m = re.search(r"<Lemma writtenForm=\"([^\"])\" part",l) if(m): lemma = m.group(1).strip() m = re.search(r"synset=\"tha-07-(.)\"",l) if(m): synset = m.group(1) o.write("%s\t%s\n" % (synset, lemma))
Mathieu
— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/2#issuecomment-39949844 .
Sorry to mail it, I am not so used to git yet :-)
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
G'day,
Thanks for the information. There is no attached file though :-)
The students have started to include relations from wolf, thus included and extended fre2tab. Is it okay to have this extended version of your script in our github project? If so, how do you want your authorship to be acknowledged? Options include adding you to the list of authors of each extended script, adding you to the global list of contributors to the project, explicit mentions of omw as basis, any combination or variant of these or anything else you see fit.
Have a nice sunday,
Mathieu On 13 Apr 2014 06:36, "Francis Bond" notifications@github.com wrote:
G'day, We could produce wordnet-LMF as part of the conversion+import process, if :
- using, adapting or writing conversion scripts is not too much work,
I attach the script I currently use to output LMF :-).
- the wordnet-LMF files can be efficiently processed for batch insertion into the database.
Could you provide conversion scripts to test this approach on one or two wordnets? I would gladly accept any pull request :-)
The Thai input is currently done from LMF, although not very generally:
!/usr/share/python
-- encoding: utf-8 --
#
Extract synset-word pairs from the Persian Wordnet
#
import sys import codecs import re
wnname = "Thai" wnlang = "tha" wnurl = "http://th.asianwordnet.org/" wnlicense = "wordnet"
#
header
# outfile = "wn-data-%s.tab" % wnlang o = codecs.open(outfile, "w", "utf-8" )
o.write("# %s\t%s\t%s\t%s \n" % (wnname, wnlang, wnurl, wnlicense))
#
Data is in the file tha-wn-1.0-lmf.xml
exploit the fact that the synset is the same as wn3.0 offset
f = codecs.open("tha-wn-1.0-lmf.xml", "r", "utf-8" )
sysnset = str() lemma = str() for l in f: m = re.search(r"<Lemma writtenForm=\"([^\"])\" part",l) if(m): lemma = m.group(1).strip() m = re.search(r"synset=\"tha-07-(.)\"",l) if(m): synset = m.group(1) o.write("%s\t%s\n" % (synset, lemma))
print "%s\t%s\n" % (synset, lemma)
Mathieu
— Reply to this email directly or view it on GitHub< https://github.com/moreymat/omw-graph/issues/2#issuecomment-39949844> .
Sorry to mail it, I am not so used to git yet :-)
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/2#issuecomment-40299422 .
G'day,
Thanks for the information. There is no attached file though :-)
The students have started to include relations from wolf, thus included and extended fre2tab. Is it okay to have this extended version of your script in our github project? If so, how do you want your authorship to be acknowledged? Options include adding you to the list of authors of each extended script, adding you to the global list of contributors to the project, explicit mentions of omw as basis, any combination or variant of these or anything else you see fit.
Please (i) add me to the global list of contributors to the project. You already link to the OMW page in the Readme, which is enough.
Have a nice sunday,
You too.
Mathieu On 13 Apr 2014 06:36, "Francis Bond" notifications@github.com wrote:
G'day, We could produce wordnet-LMF as part of the conversion+import process, if :
- using, adapting or writing conversion scripts is not too much work,
I attach the script I currently use to output LMF :-).
- the wordnet-LMF files can be efficiently processed for batch insertion into the database.
Could you provide conversion scripts to test this approach on one or two wordnets? I would gladly accept any pull request :-)
The Thai input is currently done from LMF, although not very generally:
!/usr/share/python
-- encoding: utf-8 --
#
Extract synset-word pairs from the Persian Wordnet
#
import sys import codecs import re
wnname = "Thai" wnlang = "tha" wnurl = "http://th.asianwordnet.org/" wnlicense = "wordnet"
#
header
# outfile = "wn-data-%s.tab" % wnlang o = codecs.open(outfile, "w", "utf-8" )
o.write("# %s\t%s\t%s\t%s \n" % (wnname, wnlang, wnurl, wnlicense))
#
Data is in the file tha-wn-1.0-lmf.xml
exploit the fact that the synset is the same as wn3.0 offset
f = codecs.open("tha-wn-1.0-lmf.xml", "r", "utf-8" )
sysnset = str() lemma = str() for l in f: m = re.search(r"<Lemma writtenForm=\"([^\"])\" part",l) if(m): lemma = m.group(1).strip() m = re.search(r"synset=\"tha-07-(.)\"",l) if(m): synset = m.group(1) o.write("%s\t%s\n" % (synset, lemma))
print "%s\t%s\n" % (synset, lemma)
Mathieu
— Reply to this email directly or view it on GitHub< https://github.com/moreymat/omw-graph/issues/2#issuecomment-39949844> .
Sorry to mail it, I am not so used to git yet :-)
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
— Reply to this email directly or view it on GitHub< https://github.com/moreymat/omw-graph/issues/2#issuecomment-40299422> .
— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/2#issuecomment-40301136 .
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
G'day Francis,
Adrien and Christophe noticed that you were now distributing XML files (in the LMF and lemon formats) on the OMW website at NTU.
Could you tell us how the LMF, lemon and tab files you provide compare content-wise? From a quick look at the english files, the LMF file contains relations between synsets, whereas the lemon file does not.
Is one of these formats (thinking LMF) complete and mature enough so that we can use your files as our only source of information to build the whole graph? FWIW, we could even stop depending on NLTK.
Mathieu
G'day, Adrien and Christophe noticed that you were now distributing XML files (in the LMF and lemon formats) on the OMW website at NTU.
Could you tell us how the LMF, lemon and tab files you provide compare content-wise? From a quick look at the english files, the LMF file contains relations between synsets, whereas the lemon file does not.
That's right. LEMON is just the TAB files in very verbose XML :-). The assumption is that the ontology (wordnet) is separate. LMF should be complete (although I don't guarantee it).
Is one of these formats (thinking LMF) complete and mature enough so that
we can use your files as our only source of information to build the whole graph?
In theory LMF should be, in practice I generally add information to the database first, and then to the LMF.
FWIW, we could even stop depending on NLTK.
I think it is worth trying with LMF which we hope to be the format of the future.
Wordnet-LMF (and LEMON) don't have anywhere to record frequency counts (the idea is that they are from a corpus), although in practice they are useful :-).
Wait just a little though, as I seem to have lost the English and Japanese definitions in my move to svn (although we gained Greek).
Yours,
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
@fcbond thanks a lot. It seems we can give it a try.
@zorgulle @rhin0cer0s could you provide a rough estimate of how much work it would be to use the LMF XML files instead? If it is reasonable, we might try and do this before 0.1.
G'day,
I have (finally) restored the English definitions and example so it should be good to go. There are also definitions for Albanian, Greek and Japanese :-).
On Fri, Apr 18, 2014 at 6:59 PM, Mathieu Morey notifications@github.comwrote:
@fcbond https://github.com/fcbond thanks a lot. It seems we can give it a try.
@zorgulle https://github.com/zorgulle @rhin0cer0shttps://github.com/rhin0cer0scould you provide a rough estimate of how much work it would be to use the LMF XML files instead? If it is reasonable, we might try and do this before 0.1.
— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/2#issuecomment-40801712 .
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
G'day Francis,
This is great, thank you ! I just noticed the OMW-LMF files provide exactly the information we wanted for milestone 0.1: aligned lemmas + relations from Princeton Wordnet. We will still have to scrape the original wordnets to retrieve their own structures for milestone 0.3, unless you plan to do that as well? :-)
G'day,
This is great, thank you ! I just noticed the OMW-LMF files provide exactly the information we wanted for milestone 0.1: aligned lemmas + relations from Princeton Wordnet. We will still have to scrape the original wordnets to retrieve their own structures for milestone 0.3, unless you plan to do that as well? :-)
Not in the very near future. The next priority for me is adding confidence scores (manually verified or not) and corpus frequencies.
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
Hi @fcbond and thank you for your involvement !
@moreymat We build a little parser over the week-end ( we still have to fix some things before pushing it ) so the LMF 'support' is nearly done.
Splendid! It would be great if we could release 0.1 this week. On 21 Apr 2014 10:30, "Christophe Guieu" notifications@github.com wrote:
Hi @fcbond https://github.com/fcbond and thank you for your involvement !
@moreymat https://github.com/moreymat We build a little parser over the week-end ( we still have to fix some things before pushing it ) so the LMF 'support' is nearly done.
— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/2#issuecomment-40923244 .
Hello, The lmf parser works, we can import word and relations, we tested it with English and French, we still have the index key length issue. we work on this problem. it might be solve soon
OK great, I am looking forward for this.
Each original Wordnet, for example the WOLF (Wordnet Libre du Français), contains its own language-specific structure. This structure is very valuable information that we want to import into the graph database.
As each Wordnet is distributed in its own format, we need one import function per Wordnet. The OMW team had the same need. They provide one script per Wordnet that retrieves the aligned data from the original files.
The idea is to transform each of the OMW import scripts into a function, expand each function to import more information (including structure) and wrap all functions in a module.