histograph / core

Histograph Core: Graph management and inferencing
MIT License
1 stars 3 forks source link

Dataset delete does not work #57

Open mmmenno opened 8 years ago

mmmenno commented 8 years ago

Someone accidentally uploaded the rm dataset with wrong id's, i think. Suddenly there where id's like rm/8007.0 instead of the rm/8007 it should have been.

Deleting the entire dataset and upload it again with the proper ideas should do the trick, one imagines. Not so. While https://api.histograph.io/datasets/rm/pits gives you a "Dataset 'rm' not found", https://api.histograph.io/search?q=rm/8007.0 keeps returning pits.

Maybe deleting pits / relations from ES doesn't work? Very strange & a bit alarming. Any ideas, @ocataco ?

bertspaan commented 8 years ago

This is caused by a known bug graphmalizer/graphmalizer-core#18.

Graphmalizer's Cypher queries need to be changed, so that these VACANT nodes are deleted.

There is an easy fix which would probably also work:

Change https://github.com/histograph/neo4j-plugin and make sure it only returns non-VACANT nodes. This can be done here, probably: https://github.com/histograph/neo4j-plugin/blob/master/src/main/java/org/waag/histograph/plugins/ExpandConcepts.java#L33

wires commented 8 years ago

sorry guys, missed this issue! I'll schedule some time to dive into graphmalizer/graphmalizer-core#18 again

sbocconi commented 8 years ago

There seems to be a solution for this issue, as suggested by @bertspaan. Issue labeled as ready (to be solved) then.

tomdemeyer commented 8 years ago

VACANT nodes are created on delete? This seems not ok; how to fix? Or is this a separate issue??

import dataset pdc: match(n {dataset: "pdc" } ) return count(n) --> 5402 match(n:_VACANT {dataset: "pdc" } ) return count(n) --> 0 after delete dataset: match(n {dataset: "pdc" } ) return count(n) --> 1707 match(n:_VACANT {dataset: "pdc" } ) return count(n) --> 1707

wires commented 8 years ago

probably related, didn't have time to look at this, somewhere in queries.yml I forget the clear the vacant nodes)

sbocconi commented 8 years ago

VACANT: Betekent dit: als je relatie toevoegt tussen A en B, maar PIT’s A en B bestaan (nog) niet, wordt de relatie toegevoegd, en A en B ook, maar A en B zijn dan VACANT

There are several issues here, but can we say that adding a dataset and then removing it should leave the repository in an unchanged status? Or is this not true? This can be part of the tests

sbocconi commented 8 years ago

Another comment from @wires on Slack:

volgens mij is het de situatie [+] b -> c ; [+] a == b ; [+] c == d ; [-] b -> c die problemen oplevert dus met [+] … bedoel ik toevoegen

mmmenno commented 8 years ago

I only experienced this problem with the faulty rm dataset that had id's like rm/1641.0 - so maybe it had something to do with those strange id's. @tomdemeyer deleted the faulty set manually, so for now we got rid of the problem.

With the VACANT node diagnosis, we should've seen this problem more often and we should be able to reproduce the problem, right?

sbocconi commented 8 years ago

From a discussion with @bertspaan:

Drie problemen:

1. IO verwijdert ES-index niet, omdat dit problemen opleverde met ES-service van Amazon (als je index verwijdert die al verwijderd is maakt ES de index opnieuw aan, maar met verkeerde mapping. Of iets dergelijks)
2. neo4j-plugin grijpt ​*alle*​ PITs uit graaf, ook met label `VACANT`
3. Graphmalizer verwijdert `VACANT` niet (of niet altijd) op de goede manier

Number 3 is related to https://github.com/graphmalizer/graphmalizer-core/issues/18

wires commented 8 years ago

Regarding nr 3 above / issue 18 ; the guarantee graphmalizer gives you (when it's not bugged) is that

there will never appear a "dangling vacant node", that is, for all (n:_VACANT), degree(n) > 0.

So they should only appear when there is some edge referring to them.

What we notice in #18 is that there appears to arise a situation here a single _VACANT node remains connected to a 'concept node' (the representative node of an equivalence class; the collection of equivalent things).

In any case, I can run the tests again, so I will dive into this.

ps. It is a little bit subtle as I try to remain some "mathematical" properties of the program. For instance, we -kind of- have the following guarantee on concept nodes:

there will never appear "a dangling concept node", that is, for all (n:'='), degree(n) > 1

This can be either useful or annoying or both or neither :-) it also doesn't exclude the case ()--<>--() where you have two "dangling vacant nodes" connected to a concept node. This shouldn't be allowed, etc.

anyway, enough talk! COMPUTEREN!

tomdemeyer commented 8 years ago

What we notice in #18 is that there appears to arise a situation here a single _VACANT node remains connected to a 'concept node' (the representative node of an equivalence class; the collection of equivalent things).

This is exactly the case; 100% repeatable by just deleting a dataset. For me it is not clear if the connected concept node is ever also still connected to other nodes.

I am now manually (in neo4j) deleting the vacant nodes and relations (of a particular dataset), leaving the concept nodes alone. Will investigate further.

Tom Demeyer Waag Society http://waag.org

wires commented 8 years ago

Reproducible you say? I'm having trouble with reproducing it... @tomdemeyer @mmmenno could you send me this rm dataset? I started rewriting the cypher queries that make up graphmalizer and trying to speed everything up as well...

tomdemeyer commented 8 years ago

I was reproducing it with a TNL dataset a couple of times. Can send, but your schema’s & everything don’t match.

Tom Demeyer Waag Society http://waag.org

On 05 Feb 2016, at 10:59, Jelle Herold notifications@github.com wrote:

Reproducible you say? I'm having trouble with reproducing it... @tomdemeyer @mmmenno could you send me this rm dataset? I started rewriting the cypher queries that make up graphmalizer and trying to speed everything up as well...

— Reply to this email directly or view it on GitHub.

wires commented 8 years ago

If you have a small dataset that does the job, the smaller the better, I’m interested! Types and schema’s don’t matter; I will turn the dataset into a test

tomdemeyer commented 8 years ago

I just tested with a small dataset, deletes everything a-ok. pff.

except, of course, the ES index…

Tom Demeyer Waag Society http://waag.org

On 05 Feb 2016, at 11:16, Jelle Herold notifications@github.com wrote:

If you have a small dataset that does the job, the smaller the better, I’m interested! Types and schema’s don’t matter; I will turn the dataset into a test — Reply to this email directly or view it on GitHub.

mmmenno commented 8 years ago

@wires you could use (a small subset of) https://api.histograph.io/datasets/rm/pits. To really reproduce this issue, change the id's from id to id.0. O, and please leave the rm dataset on production as it is now, since I'll be giving a presentation at RCE - proud owners of rm - next wednesday!