trellis-ldp-archive / trellis-cassandra

Trellis LDP using Apache Cassandra for persistence
Other
2 stars 4 forks source link

Duplication possible of the contains relationship #32

Closed gregjan closed 5 years ago

gregjan commented 5 years ago

I was running performance tests. What I found after two tests, between which I failed to reset the database, was that the root folder showed two identical ldp:contains relationships. Presumably there is only one contained resource, since the object of both triples was the same. I will look at the C* tables and report what I find there in a follow up comment.

ajs6f commented 5 years ago

Nope, I know exactly what this is. I forgot to make t-c* check whether there is a root container before creating one, so it's going to create a new one on every startup, and those triples are that becoming visible. Fix OTW immediately!

ajs6f commented 5 years ago

@gregjan Just added a commit to master-- can you try that? Should fix this.

ajs6f commented 5 years ago

@gregjan When you have a chance, please confirm that this issue is fixed (or not!) and close this ticket as is appropriate. Thanks!

ajs6f commented 5 years ago

@gregjan Did you have a chance to check whether this delivered a fix?

ajs6f commented 5 years ago

@gregjan Just a ping on this ticket-- did the fix mentioned in https://github.com/trellis-ldp/trellis-cassandra/issues/32#issuecomment-442875457 do the job?

ajs6f commented 5 years ago

Based on our 'phone conversation the other day, I'm going to close this. Cool, @gregjan?

gregjan commented 5 years ago

I am seeing this issue again. I will copy my comment in here from the consistency feature, where it does exactly belong. In short, the folder test has raised this issue again.

http://ciber-vs1.umd.edu:10080/ http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv .

gregjan commented 5 years ago

So I think I have uncovered that duplication bug again, that may relate to consistency, in my testing. I am seeing multiple contains relationships to a child from the root node in the new folder test. You can see that root folder here: http://ciber-vs1.umd.edu:10080/

There are 46 contains relationships to the same subfolder, which was created via a POST with the slug of "srv". This resulted from a test where parallel workers would have tried to create the folder and all but the first one would expect a 409 response. I also find it interesting that there are only 46, while there were 11963 unique users recorded in ES. So we don't have 11963 contains relations, only 46.. I will post a follow up with what I see in the C* table.

gregjan commented 5 years ago

cqlsh:trellis> select * from basiccontainment;

container | identifier | created ---------------+------------------+-------------------------------------- trellis:data/ | trellis:data/srv | 3034b1b0-3f64-11e9-a0cd-75dc2f8c34a4 trellis:data/ | trellis:data/srv | 3033a040-3f64-11e9-a0cd-75dc2f8c34a4 trellis:data/ | trellis:data/srv | 30337934-3f64-11e9-a0cd-75dc2f8c34a4 trellis:data/ | trellis:data/srv | 30337933-3f64-11e9-a0cd-75dc2f8c34a4 trellis:data/ | trellis:data/srv | 30337932-3f64-11e9-a0cd-75dc2f8c34a4 trellis:data/ | trellis:data/srv | 30337931-3f64-11e9-a0cd-75dc2f8c34a4 trellis:data/ | trellis:data/srv | 30337930-3f64-11e9-a0cd-75dc2f8c34a4 trellis:data/ | trellis:data/srv | 302e4915-3f64-11e9-9acd-f1fd5771304a trellis:data/ | trellis:data/srv | 302e4915-3f64-11e9-838d-1b40b670cc8f trellis:data/ | trellis:data/srv | 302e4914-3f64-11e9-9acd-f1fd5771304a trellis:data/ | trellis:data/srv | 302e4914-3f64-11e9-838d-1b40b670cc8f trellis:data/ | trellis:data/srv | 302e4913-3f64-11e9-9acd-f1fd5771304a trellis:data/ | trellis:data/srv | 302e4913-3f64-11e9-838d-1b40b670cc8f trellis:data/ | trellis:data/srv | 302e4912-3f64-11e9-9acd-f1fd5771304a trellis:data/ | trellis:data/srv | 302e4912-3f64-11e9-838d-1b40b670cc8f trellis:data/ | trellis:data/srv | 302e4911-3f64-11e9-9acd-f1fd5771304a trellis:data/ | trellis:data/srv | 302e4911-3f64-11e9-838d-1b40b670cc8f trellis:data/ | trellis:data/srv | 302e4910-3f64-11e9-9acd-f1fd5771304a trellis:data/ | trellis:data/srv | 302e4910-3f64-11e9-838d-1b40b670cc8f trellis:data/ | trellis:data/srv | 302b3bd0-3f64-11e9-8f2d-7bb57da4be02 trellis:data/ | trellis:data/srv | 302b14c4-3f64-11e9-8f2d-7bb57da4be02 trellis:data/ | trellis:data/srv | 302b14c3-3f64-11e9-8f2d-7bb57da4be02 trellis:data/ | trellis:data/srv | 302b14c2-3f64-11e9-8f2d-7bb57da4be02 trellis:data/ | trellis:data/srv | 302b14c1-3f64-11e9-8f2d-7bb57da4be02 trellis:data/ | trellis:data/srv | 302b14c0-3f64-11e9-8f2d-7bb57da4be02

^^ 25 rows

gregjan commented 5 years ago

cqlsh:trellis> select * from immutabledata ;

identifier | created | quads ------------------+---------------------------------+------- trellis:data/srv | 2019-03-05 16:31:54.978000+0000 | null trellis:data/srv | 2019-03-05 16:31:54.977000+0000 | null trellis:data/srv | 2019-03-05 16:31:54.942000+0000 | null trellis:data/srv | 2019-03-05 16:31:54.940000+0000 | null trellis:data/srv | 2019-03-05 16:31:54.939000+0000 | null trellis:data/srv | 2019-03-05 16:31:54.909000+0000 | null trellis:data/srv | 2019-03-05 16:31:54.908000+0000 | null

gregjan commented 5 years ago

Lastly, in the mutabledata table, I see similar duplication for the trellis:data/srv objects.

25 rows like this:

trellis:data/srv | 3034b1b0-3f64-11e9-a0cd-75dc2f8c34a4 | null | trellis:data/ | 2019-03-05 16:31:54.000000+0000 | null | http://www.w3.org/ns/ldp#BasicContainer | null | 2019-03-05 16:31:54.955000+0000 | <trellis:data/srv> http://purl.org/dc/terms/title "srv" http://www.trellisldp.org/ns/trellis#PreferUserManaged .\n<trellis:data/srv> http://purl.org/dc/terms/extent "0" http://www.trellisldp.org/ns/trellis#PreferUserManaged .\n

However, I also see 4 rows representing the root node:

trellis:data/ | 2fead770-3f64-11e9-a0cd-75dc2f8c34a4 |             null |          null | 2019-03-05 16:31:54.000000+0000 |   null | http://www.w3.org/ns/ldp#BasicContainer |     null | 2019-03-05 16:31:55.061000+0000 |                                                                                                                                                  

The uuids are different in all cases within a table, from my spot tests.

gregjan commented 5 years ago

Note that the trellis keyspace on my C* cluster shows a replication factor of just 1. So additional consistency may not help. There is not collision between the various items, since the created timestamp Instant is included in the primary key. In fact each may or may not be in the same partition.

Note that the root node is created at webapp startup, which is happening on four nodes in this recent test. These node partition keys include the Instant, which varies slightly for each root.

I wonder if we need the instant in the partition key at all? Without it we can hash predict the C* nodes that have our data based on path alone. I also wonder if we are going to need a "lightweight transaction", adding "IF NOT EXISTS" to the InsertImmutable and possibly other queries. I can give this a shot locally.

acoburn commented 5 years ago

The basiccontainment table is a materialized view that queries the mutabledata table. From what I can tell, both ResourceService::create and ResourceService::replace issue INSERT INTO ... requests to the mutabledata table, so any replace operation (e.g. PUT or PATCH) will lead to duplicate rows in that table for a given identifier (this may be intentional for storing mementos). Presumably, the DISTINCT keyword could be added to that materialized view query, but I'm not sure if that would consider the case where a child resource has been deleted and should be excluded from the results altogether.

gregjan commented 5 years ago

I'm not sure either, but I did realize that the Instant inclusion is probably due to the versioning requirements. So that makes more sense to me now. It does seem like a filter is missing here to return only the most recent contents.

gregjan commented 5 years ago

Given the schema, it seems like the CassandraResource.basicContainmentTriples() method should include some filtering, or that the underlying view should include that. I am wondering if the duplicate rows are part of memento support or not. Perhaps some, but not all.. If the triple between parent and child already exists, then do we need the new row in order to update the timestamp?

gregjan commented 5 years ago

Hmm, it strikes me that a POST or a PUT would simply update the object in question, creating a new mutable metadata record. The bug seems to in fact be that contains triples are being reported out redundantly for all revisions of the container, i.e. it is a GET error, rather than an error in the recorded data. (409s are for checksum mismatch or unsupported triples, not for Conflicting LDP paths, it seems.)

gregjan commented 5 years ago

As an aside, it seems like my test scenario should check if the folder exists before POSTing a new copy of it, unless we want to test that function exhaustively.

gregjan commented 5 years ago

Looks like the materialized view has this: .. WITH CLUSTERING ORDER BY (identifier ASC, created DESC) ..

So this means that the results should stream out in default order that way, with a series of mementos for each child in turn. The first one of each should be the latest one.

gregjan commented 5 years ago

I think one fix is to add .distinct() to the basicContainmentTriples() method's streaming pipeline. Adding "DISTINCT" to the query of the current view does not work, due to C* restrictions of DISTINCT to the partition key, which does not currently include "identifier" only "container". If we update the materialized view to include the identifier in the partition key: .. PRIMARY KEY ((container, identifier), created) .. (added inner parens) Then we can use DISTINCT container, identifier and throw away the extra column, getting one copy of each child relationship. (Have to ALLOW FILTERING in the query also..)

gregjan commented 5 years ago

I have a test of this ready to go, whenever I can get back to some decent wifi for the builds.

gregjan commented 5 years ago

I will check the triples stored for the container for the contains relationship, since I still find one extra triple in the GET output.

gregjan commented 5 years ago

Here is the example of a container that shows duplicate relationships. This is after a distinct() was added to the basicContainmentTriples() method:

jansen@X1:~$ curl http://ciber-vs1.umd.edu:10080/srv/ciber/Transfer+Notes
<http://ciber-vs1.umd.edu:10080/srv/ciber/Transfer+Notes>
        <http://purl.org/dc/terms/extent>  "2" ;
        <http://purl.org/dc/terms/title>  "Transfer+Notes" ;
        <http://www.w3.org/ns/ldp#contains>  <http://ciber-vs1.umd.edu:10080/srv/ciber/Transfer+Notes/nara1_vault10> ;
        <http://www.w3.org/ns/ldp#contains>  <http://ciber-vs1.umd.edu:10080/srv/ciber/Transfer+Notes/nara1_vault6> ;
        <http://www.w3.org/ns/ldp#contains>  <http://ciber-vs1.umd.edu:10080/srv/ciber/Transfer+Notes/nara2_vault10> ;
        <http://www.w3.org/ns/ldp#contains>  <http://ciber-vs1.umd.edu:10080/srv/ciber/Transfer+Notes/nara1_vault10> ;
        <http://www.w3.org/ns/ldp#contains>  <http://ciber-vs1.umd.edu:10080/srv/ciber/Transfer+Notes/nara1_vault6> ;
        <http://www.w3.org/ns/ldp#contains>  <http://ciber-vs1.umd.edu:10080/srv/ciber/Transfer+Notes/nara2_vault10> .

When I look at the mutabledata table for this identifier, I see that the latest row has this quads text:

 <trellis:data/srv/ciber/Transfer+Notes> <http://purl.org/dc/terms/title> "Transfer+Notes" <http://www.trellisldp.org/ns/trellis#PreferUserManaged> .\n<trellis:data/srv/ciber/Transfer+Notes> <http://purl.org/dc/terms/extent> "2" <http://www.trellisldp.org/ns/trellis#PreferUserManaged> .\n

There are various timestamped records that are identical to this one and I do not see any contains relationships in the stored quads. Note that there are no quads stored for this identified in the immutabledata table.

gregjan commented 5 years ago

Okay, I can confirm that after adding distinct() in CassandraResource.basicContainmentTriples() results in exactly 2 copies of each containment triple in output. It looks to me as if the logic of PreferContainment and PreferMembership in the calling code (CassandraResource#125) is not mutually exclusive.

ajs6f commented 5 years ago

Awesome, @gregjan, that's a really precise diagnostic. I'll get on a fix right away.

ajs6f commented 5 years ago

Hang on, before I do: @acoburn, take a look at @gregjan's comment above. If he's right, then is it a question of the backend emitting distinct quads and the HTTP layer creating triples by dropping the graph slot? That's not wrong, from the semantic POV (and the responses at HTTP that @gregjan is seeing aren't wrong, in that sense) but it does seem to be surprising. Do we need to add ::distinct up there, or should we make a claim about expectations? I tend to the former position.

ajs6f commented 5 years ago

Or again, @gregjan, am I misunderstanding the problem: are the other Trellis backends you've seen showing this problem?

gregjan commented 5 years ago

Good question, I will start one and see.

On Thu, Mar 21, 2019 at 1:58 PM A. Soroka notifications@github.com wrote:

Or again, @gregjan https://github.com/gregjan, am I misunderstanding the problem: are the other Trellis backends you've seen showing this problem?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/trellis-ldp/trellis-cassandra/issues/32#issuecomment-475339416, or mute the thread https://github.com/notifications/unsubscribe-auth/AACFztI6BHIj28KE1KrmhtmFZ5eJMVsiks5vY8hdgaJpZM4Y52Zt .

gregjan commented 5 years ago

I just managed not to duplicate this on trellis external db (version 2.1-SNAPSHOT). Added three resources at the root. They were only reported once as "contains" objects. Then I verified the non-dupe at one level below root, inside a container.

gregjan commented 5 years ago

Just verified it on trellis-extdb:0.3.0-SNAPSHOT image too.

ajs6f commented 5 years ago

Interesting, so it's not what I suggested above, and what's more, it's not quite what @gregjan suggested. @gregjan, can you show me an example request that you're making to get this behavior? I don't think it could be a case of the triple being added to both graphs, because there's no way I can see them being both being flattened and streamed to triples. But I'm not sure what your settings are. Are you sending any Prefer headers?

ajs6f commented 5 years ago

Also, not to beat the logging horse, but this line does give us the option to turn logging up to TRACE and see what containment triples are being built.

gregjan commented 5 years ago

Hmm, my recent test added containers to other containers. Perhaps I should stick to my original case, which was adding binaries to containers.. BRB

gregjan commented 5 years ago

Here is a link to my session that duplicates it using container objects this time. Didn't set any Prefer headers in my first GET, but I did in a second request (include PreferContainment). https://gist.github.com/gregjan/115c860921d259ea151c87703e5cccfd

gregjan commented 5 years ago

Now when I omit PreferContainment, I see only one contains relationship:

jansen@X1:~$ curl -H "Prefer: return=representation; omit=\"http://www.w3.org/ns/ldp#PreferContainment\"" http://ciber-vs1.umd.edu:10080/cbad3300-337f-40f9-aeaf-592425dc2c9d
<http://ciber-vs1.umd.edu:10080/cbad3300-337f-40f9-aeaf-592425dc2c9d>
        <http://www.w3.org/ns/ldp#contains>  <http://ciber-vs1.umd.edu:10080/cbad3300-337f-40f9-aeaf-592425dc2c9d/a6a6b759-c0fb-46e0-a1b2-4e13002ff0ec> .
gregjan commented 5 years ago

I apologize if this is the intended functionality w/regards to spec!

ajs6f commented 5 years ago

Aha! And just to be clear, is that the exact same sequence you used for the other backends, including the last request with PreferContainment omitted?

ajs6f commented 5 years ago

Right, that's what I'm trying to figure out. It's surprising no matter what, so we probably have some work to do, but I'm trying to figure out whether t-c* is doing something wrong or whether Trellis is doing everything right ("semantically" and to the spec) but could be a little more "ergonomic" by dropping duplicate triples. Or something else.

gregjan commented 5 years ago

Nope. In fact I never set Prefer headers in the test of the DB back end. That test only showed one contains relationship w/o Prefer header set, so I didn't see a need at that point.

ajs6f commented 5 years ago

Ok, are you saying that this sequence is a test that distinguishes t-c* from the other backends? Duplicate triples from t-c* but not from the others? I'm sorry to be so finicky about this, but it's to a good cause; I'm going to reproduce the behavior.

gregjan commented 5 years ago

NOTE: I just ran the same test on the DB back end. If I Prefer to omit containment representations there, then I don't see any contains relationships reported at all.

gregjan commented 5 years ago

It looks to me that the other back end (DB) does not consider the contains triples to be a part of the PreferMembership representation, only the PreferContainment representation.

ajs6f commented 5 years ago

Okay, so t-c* is emitting one containment triple when you omit PreferContainment and two when you do not omit it?

gregjan commented 5 years ago

Yes, indeed.

ajs6f commented 5 years ago

Ok, cool. Can you possibly run that sequence (set up, request with no prefers, request with omit PreferContainment) at logging levels DEBUG for org.trellisldp and TRACE for edu.si? That should enable me to see where Trellis is changing its calls to the backend for those different requests. Then I can figure what the backend is doing differently, if anything.

gregjan commented 5 years ago

Sure thing. Back in 5 mins or so.

gregjan commented 5 years ago

I thought this had trace turned on, but I don't see any TRACE lines. Anyway, here is my first attempt with both session and logs: https://gist.github.com/gregjan/ce621282223141e1e53ccb68843f17f7

gregjan commented 5 years ago

Turning logging back up and running again.. 10 mins

gregjan commented 5 years ago

Ouch, that build failed to start for some reason related to the config provider. I'll have to dig into that tomorrow. Hope that the debug session above helps somewhat in the meantime.

ajs6f commented 5 years ago

@acoburn, are you using Resource::stream(Collection<IRI>) and :: stream(IRI) to select for different Prefer headers?

ajs6f commented 5 years ago

Looks like no.