Closed zorgulle closed 10 years ago
I just finish it, did it dirty but currently only english data supporting functions have been written so I could only work on them.
edit : it seems to make the batch importer skip some relations, I did'nt have this warning before. Something more to look at ...
@zorgulle could you make the writing operation overwriting actual files ? it is awful to do rm each time I test : (
@rhin0cer0s are we done with this issue?
no, i ve got some duplicates now and import is impossible for english. It seems fine for french and albanese.
OK, please ping me when you think the problem is solved.
Issue fixed.
FYI : duplicates were build because in the index I replaced "_" by "" so some word which have two writtings added up.
e.g :
09862845-n#musclebuilder#eng musclebuilder
09862845-n#musclebuilder#eng muscle builder
I let the replace function but now it is "" to "" if we still need it later we will just change one char.
We kept the synset#word#lng syntax with the new lmf parser but it is ugly :
We were thinking about building it around the sense id + language :
this id is supposed to be uniq ( since it is a xml id ) with lng added it is uniq between languages
I think we could benefit more from the graph structure with:
fra-10-05976065-n
as key (as in the LMF files)fra-10-w487890
as key (following the "language-version-XXX" pattern used in the LMF files for synsets), with lemma (written form and part of speech) as attribute,What do you think?
That's what I tried to explain during our last meeting ! I find it great because it will reduce our relation number. We will try to produce it.
Sorry for not getting your point back then! On 22 Apr 2014 18:06, "Christophe Guieu" notifications@github.com wrote:
That's what I tried to explain during our last meeting ! I find it great because it will reduce our relation number. We will try to produce it.
— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/8#issuecomment-41059137 .
It works but there are some problems with the synsets. During relations declarations there are some synsets used which are not declared in a lexical entry. So some relations are not buildable in neo4j because there is no node with the right synset.
But we can find them in SenseAxis declarations. So we can find english equivalent and we know its place thanks to synset relations.
Could it be translations holes that we are looking for ?
@rhin0cer0s : Yes, some of the missing synsets should be translation holes.
@fcbond : Does each LMF file contain all synsets and relations from PWN, or do you filter to keep a resource-dependent subset of synsets and relations that covers the resource (e.g. with a connected graph) ?
G'day,
@fcbond https://github.com/fcbond : Does each LMF file contain all
synsets and relations from PWN, or do you filter to keep a resource-dependent subset of synsets and relations that covers the resource (e.g. with a connected graph) ?
We have all links, but only senses, definitions and examples from that language.
So some nodes have no information associated with them.
I am hoping that you will produce resource-dependent subsets for me.
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
@rhin0cer0s : can you provide an example here ?
@fcbond : let us open an issue on "producing resource-dependent subsets of relations". @rhin0cer0s @zorgulle do you think we could include this issue for milestone 0.2 ?
We build the list of orphan synsets 151.378 we are even able to omit them if we want a clean import ( they were just skipped before ).
for example fra-10-11820323-n
, if you do a grep on wn-fra-lmf.xml with it :
<Synset id='fra-10-11820323-n' baseConcept='3'>
<SynsetRelation targets='fra-10-11820323-n' relType='hypo'/>
<Target ID='fra-10-11820323-n'/>
Regarding the milestone I don't know how to handle it. Today we can build a db including different languages ( we tested on fra, alb and jpn. English LMF doesn't seem released ).
@rhin0cer0s The English LMF is on the OMW site (under "Princeton WordNet").
Issues #13 and #14 for milestones 0.2 and 0.3 will follow up on this.
I think we can close this issue for now.
In order to avoid redundancy between different language we could add the language reference to the index key and add a separator character.
eg : 00000000-n_rain_eng