Add language to index key

zorgulle commented 10 years ago

In order to avoid redundancy between different language we could add the language reference to the index key and add a separator character.

eg : 00000000-n_rain_eng

rhin0cer0s commented 10 years ago

I just finish it, did it dirty but currently only english data supporting functions have been written so I could only work on them.

edit : it seems to make the batch importer skip some relations, I did'nt have this warning before. Something more to look at ...

@zorgulle could you make the writing operation overwriting actual files ? it is awful to do rm each time I test : (

moreymat commented 10 years ago

@rhin0cer0s are we done with this issue?

rhin0cer0s commented 10 years ago

no, i ve got some duplicates now and import is impossible for english. It seems fine for french and albanese.

moreymat commented 10 years ago

OK, please ping me when you think the problem is solved.

rhin0cer0s commented 10 years ago

Issue fixed.

FYI : duplicates were build because in the index I replaced "_" by "" so some word which have two writtings added up.

e.g :

09862845-n#musclebuilder#eng    musclebuilder
09862845-n#musclebuilder#eng    muscle builder

I let the replace function but now it is "" to "" if we still need it later we will just change one char.

rhin0cer0s commented 10 years ago

We kept the synset#word#lng syntax with the new lmf parser but it is ugly :

01840412-n#pic à bec ivoire|fy=ivoarsnaffelspjocht|nl=ivoorsnavelspecht#fra - yes this is just an index
02050442-n#grèbe jougris|ko=큰논병아리|nl=roodhalsfuut#fra - yay korean !

We were thinking about building it around the sense id + language :

w460232#01252918-n#fra

this id is supposed to be uniq ( since it is a xml id ) with lng added it is uniq between languages

moreymat commented 10 years ago

I think we could benefit more from the graph structure with:

one node per (resource-specific) synset: fra-10-05976065-n as key (as in the LMF files)
one node per (resource-specific) lexical entry: fra-10-w487890 as key (following the "language-version-XXX" pattern used in the LMF files for synsets), with lemma (written form and part of speech) as attribute,
the rest as relations, including the lexical entry - synset associations.

What do you think?

rhin0cer0s commented 10 years ago

That's what I tried to explain during our last meeting ! I find it great because it will reduce our relation number. We will try to produce it.

moreymat commented 10 years ago

Sorry for not getting your point back then! On 22 Apr 2014 18:06, "Christophe Guieu" notifications@github.com wrote:

That's what I tried to explain during our last meeting ! I find it great because it will reduce our relation number. We will try to produce it.

— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/8#issuecomment-41059137 .

rhin0cer0s commented 10 years ago

It works but there are some problems with the synsets. During relations declarations there are some synsets used which are not declared in a lexical entry. So some relations are not buildable in neo4j because there is no node with the right synset.

But we can find them in SenseAxis declarations. So we can find english equivalent and we know its place thanks to synset relations.

Could it be translations holes that we are looking for ?

moreymat commented 10 years ago

@rhin0cer0s : Yes, some of the missing synsets should be translation holes.

@fcbond : Does each LMF file contain all synsets and relations from PWN, or do you filter to keep a resource-dependent subset of synsets and relations that covers the resource (e.g. with a connected graph) ?

fcbond commented 10 years ago

G'day,

@fcbond https://github.com/fcbond : Does each LMF file contain all

synsets and relations from PWN, or do you filter to keep a resource-dependent subset of synsets and relations that covers the resource (e.g. with a connected graph) ?

We have all links, but only senses, definitions and examples from that language.

So some nodes have no information associated with them.

I am hoping that you will produce resource-dependent subsets for me.

Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

moreymat commented 10 years ago

@rhin0cer0s : can you provide an example here ?

@fcbond : let us open an issue on "producing resource-dependent subsets of relations". @rhin0cer0s @zorgulle do you think we could include this issue for milestone 0.2 ?

rhin0cer0s commented 10 years ago

We build the list of orphan synsets 151.378 we are even able to omit them if we want a clean import ( they were just skipped before ).

for example fra-10-11820323-n, if you do a grep on wn-fra-lmf.xml with it :

<Synset id='fra-10-11820323-n' baseConcept='3'>
         <SynsetRelation targets='fra-10-11820323-n' relType='hypo'/>
      <Target ID='fra-10-11820323-n'/>

Regarding the milestone I don't know how to handle it. Today we can build a db including different languages ( we tested on fra, alb and jpn. English LMF doesn't seem released ).

moreymat commented 10 years ago

@rhin0cer0s The English LMF is on the OMW site (under "Princeton WordNet").

Issues #13 and #14 for milestones 0.2 and 0.3 will follow up on this.

I think we can close this issue for now.

moreymat / omw-graph

Add language to index key #8