own-pt / openWordnet-PT

OpenWordnet-PT: an open access wordnet for Portuguese
http://openwordnet-pt.org
Other
154 stars 35 forks source link

Re-import incorrectly deleted nomlexes from the original file #103

Closed fcbr closed 8 years ago

fcbr commented 8 years ago

[ Placeholder issue, needs more careful analysis before implementation! ]

We need to reimport the missing nomlexes that have been incorrectly deleted (issue #102)

We need to make sure that this does not create conflict with existing words.

fcbr commented 8 years ago

I just imported the nomlex.rdf on a separate repository and counted 4238 nominalizations. Looking at our current SPARQL endpoint, I found 4239 nominalizations. So I can't recall where the information that we lost nomlexes came from.

fcbr commented 8 years ago

The only strange thing that I found was by executing the following SPARQL:

select *
{
  ?x a nomlex:Nominalization .
  ?x nomlex:verb ?v .
  ?x nomlex:noun ?n .
  filter not exists {
    ?v a wn30:Word .
  }
}

(and similarly for noun). I found only one verb that doesn't exist.

fcbr commented 8 years ago

More queries:

All nomlexes that point to verbs that are not involved in word senses:

select (count(*) as ?c)
{
  ?x a nomlex:Nominalization .
  ?x nomlex:verb ?v .
  ?x nomlex:noun ?n .
  filter not exists {
    ?ws wn30:word ?v .
    ?ws a wn30:WordSense .
  }
}

(total: 675)

All nomlexes that point to nouns that are not involved in word senses:

select (count(*) as ?c)
{
  ?x a nomlex:Nominalization .
  ?x nomlex:verb ?v .
  ?x nomlex:noun ?n .
  filter not exists {
    ?ws wn30:word ?n .
    ?ws a wn30:WordSense .
  }
}

(total: 1554)

fcbr commented 8 years ago

At @arademaker 's suggestion, I imported the previous RDF and still got the same count as before (4238). Previously, the last checkin on that file was on 2014.

Still investigating to understand what we thought happened.

vcvpaiva commented 8 years ago

hmm, but when we counted the nominalizations earlier this year we only had some 3600 of them. maybe we counted it wrong then, but the point was that we didn't have the 4238 that we were supposed to have. actually I found the old emails. in March 16th, I've asked: eu fiz a SPARQL query tres vezes e o numero de nomlex que eu ganho (qdo download como cvs, entro no excel e verifico o numero de linhas) sao 3587 nada de 4200, 'e por isso que estou perguntando de onde vem esse numero. the sparql query that I was executing was http://wnpt.brlcloud.com:10035/repositories/wn30#query/r/all-nomlex this is: PREFIX nomlex: https://w3id.org/own-pt/nomlex/schema/

select ?w1 ?w2 ?prov { ?nm a nomlex:Nominalization ; nomlex:verb ?w1 ; nomlex:noun ?w2 ; dc:provenance ?prov . } but right now (executing the query and downloading it again) I get a total of 3527 nominalizations. the difference seems to the be the filter?

fcbr commented 8 years ago

Thank you for that query Valeria! I think I may have an idea of what may be going on.

With your query we indeed get 3587 nominalizations.

But if we remove the dc:provenance line, we get 4238. We have 718 nominalizations without provenance.

Now, on to investigate where these nominalizations without a provenance come from.

fcbr commented 8 years ago

OK, looking back at the original Nomlex RDF we also have entries without a provenance there. I counted 3532 with provenance and 709 without provenance, adding to 4238.

Given that over the years we have manually edited nomlexes to fix mispellings, etc., I think these numbers make sense.

So I guess we found the issue after all, it seems to be simply a bad SPARQL query that lead us to think that we have removed nomlexes.

vcvpaiva commented 8 years ago

agreed! the problem must have been the 718 nominalizations without provenance. but I still have a problem. above you say

All nomlexes that point to verbs that are not involved in word senses:...(total: 675) but in the GitHub experiment we had (in March 2016) http://wnpt.brlcloud.com/wn/prototypes/corpora#nomlexfloating Number of words in corpus: 699. In OWN-PT: 65. In suggestions: 224. Missing: 410. I'm at work, where I cannot run the query, but how many floating-verbs do we have now?

arademaker commented 8 years ago

@vcvpaiva can we close this issue and open another one to talk about floating nomlexes?

vcvpaiva commented 8 years ago

sure, the issue of missing nomlexes is sorted out. but I think only Chalub can close it?