Closed gregjan closed 5 years ago
Nope, I know exactly what this is. I forgot to make t-c* check whether there is a root container before creating one, so it's going to create a new one on every startup, and those triples are that becoming visible. Fix OTW immediately!
@gregjan Just added a commit to master-- can you try that? Should fix this.
@gregjan When you have a chance, please confirm that this issue is fixed (or not!) and close this ticket as is appropriate. Thanks!
@gregjan Just a ping on this ticket-- did the fix mentioned in https://github.com/trellis-ldp/trellis-cassandra/issues/32#issuecomment-442875457 do the job?
Based on our 'phone conversation the other day, I'm going to close this. Cool, @gregjan?
I am seeing this issue again. I will copy my comment in here from the consistency feature, where it does exactly belong. In short, the folder test has raised this issue again.
http://ciber-vs1.umd.edu:10080/ http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv ; http://www.w3.org/ns/ldp#contains http://ciber-vs1.umd.edu:10080/srv .
So I think I have uncovered that duplication bug again, that may relate to consistency, in my testing. I am seeing multiple contains relationships to a child from the root node in the new folder test. You can see that root folder here: http://ciber-vs1.umd.edu:10080/
There are 46 contains relationships to the same subfolder, which was created via a POST with the slug of "srv". This resulted from a test where parallel workers would have tried to create the folder and all but the first one would expect a 409 response. I also find it interesting that there are only 46, while there were 11963 unique users recorded in ES. So we don't have 11963 contains relations, only 46.. I will post a follow up with what I see in the C* table.
cqlsh:trellis> select * from basiccontainment;
container | identifier | created ---------------+------------------+-------------------------------------- trellis:data/ | trellis:data/srv | 3034b1b0-3f64-11e9-a0cd-75dc2f8c34a4 trellis:data/ | trellis:data/srv | 3033a040-3f64-11e9-a0cd-75dc2f8c34a4 trellis:data/ | trellis:data/srv | 30337934-3f64-11e9-a0cd-75dc2f8c34a4 trellis:data/ | trellis:data/srv | 30337933-3f64-11e9-a0cd-75dc2f8c34a4 trellis:data/ | trellis:data/srv | 30337932-3f64-11e9-a0cd-75dc2f8c34a4 trellis:data/ | trellis:data/srv | 30337931-3f64-11e9-a0cd-75dc2f8c34a4 trellis:data/ | trellis:data/srv | 30337930-3f64-11e9-a0cd-75dc2f8c34a4 trellis:data/ | trellis:data/srv | 302e4915-3f64-11e9-9acd-f1fd5771304a trellis:data/ | trellis:data/srv | 302e4915-3f64-11e9-838d-1b40b670cc8f trellis:data/ | trellis:data/srv | 302e4914-3f64-11e9-9acd-f1fd5771304a trellis:data/ | trellis:data/srv | 302e4914-3f64-11e9-838d-1b40b670cc8f trellis:data/ | trellis:data/srv | 302e4913-3f64-11e9-9acd-f1fd5771304a trellis:data/ | trellis:data/srv | 302e4913-3f64-11e9-838d-1b40b670cc8f trellis:data/ | trellis:data/srv | 302e4912-3f64-11e9-9acd-f1fd5771304a trellis:data/ | trellis:data/srv | 302e4912-3f64-11e9-838d-1b40b670cc8f trellis:data/ | trellis:data/srv | 302e4911-3f64-11e9-9acd-f1fd5771304a trellis:data/ | trellis:data/srv | 302e4911-3f64-11e9-838d-1b40b670cc8f trellis:data/ | trellis:data/srv | 302e4910-3f64-11e9-9acd-f1fd5771304a trellis:data/ | trellis:data/srv | 302e4910-3f64-11e9-838d-1b40b670cc8f trellis:data/ | trellis:data/srv | 302b3bd0-3f64-11e9-8f2d-7bb57da4be02 trellis:data/ | trellis:data/srv | 302b14c4-3f64-11e9-8f2d-7bb57da4be02 trellis:data/ | trellis:data/srv | 302b14c3-3f64-11e9-8f2d-7bb57da4be02 trellis:data/ | trellis:data/srv | 302b14c2-3f64-11e9-8f2d-7bb57da4be02 trellis:data/ | trellis:data/srv | 302b14c1-3f64-11e9-8f2d-7bb57da4be02 trellis:data/ | trellis:data/srv | 302b14c0-3f64-11e9-8f2d-7bb57da4be02
^^ 25 rows
cqlsh:trellis> select * from immutabledata ;
identifier | created | quads ------------------+---------------------------------+------- trellis:data/srv | 2019-03-05 16:31:54.978000+0000 | null trellis:data/srv | 2019-03-05 16:31:54.977000+0000 | null trellis:data/srv | 2019-03-05 16:31:54.942000+0000 | null trellis:data/srv | 2019-03-05 16:31:54.940000+0000 | null trellis:data/srv | 2019-03-05 16:31:54.939000+0000 | null trellis:data/srv | 2019-03-05 16:31:54.909000+0000 | null trellis:data/srv | 2019-03-05 16:31:54.908000+0000 | null
Lastly, in the mutabledata table, I see similar duplication for the trellis:data/srv objects.
25 rows like this:
trellis:data/srv | 3034b1b0-3f64-11e9-a0cd-75dc2f8c34a4 | null | trellis:data/ | 2019-03-05 16:31:54.000000+0000 | null | http://www.w3.org/ns/ldp#BasicContainer | null | 2019-03-05 16:31:54.955000+0000 | <trellis:data/srv> http://purl.org/dc/terms/title "srv" http://www.trellisldp.org/ns/trellis#PreferUserManaged .\n<trellis:data/srv> http://purl.org/dc/terms/extent "0" http://www.trellisldp.org/ns/trellis#PreferUserManaged .\n
However, I also see 4 rows representing the root node:
trellis:data/ | 2fead770-3f64-11e9-a0cd-75dc2f8c34a4 | null | null | 2019-03-05 16:31:54.000000+0000 | null | http://www.w3.org/ns/ldp#BasicContainer | null | 2019-03-05 16:31:55.061000+0000 |
The uuids are different in all cases within a table, from my spot tests.
Note that the trellis keyspace on my C* cluster shows a replication factor of just 1. So additional consistency may not help. There is not collision between the various items, since the created timestamp Instant is included in the primary key. In fact each may or may not be in the same partition.
Note that the root node is created at webapp startup, which is happening on four nodes in this recent test. These node partition keys include the Instant, which varies slightly for each root.
I wonder if we need the instant in the partition key at all? Without it we can hash predict the C* nodes that have our data based on path alone. I also wonder if we are going to need a "lightweight transaction", adding "IF NOT EXISTS" to the InsertImmutable and possibly other queries. I can give this a shot locally.
The basiccontainment
table is a materialized view that queries the mutabledata
table. From what I can tell, both ResourceService::create
and ResourceService::replace
issue INSERT INTO ...
requests to the mutabledata
table, so any replace
operation (e.g. PUT or PATCH) will lead to duplicate rows in that table for a given identifier (this may be intentional for storing mementos). Presumably, the DISTINCT
keyword could be added to that materialized view query, but I'm not sure if that would consider the case where a child resource has been deleted and should be excluded from the results altogether.
I'm not sure either, but I did realize that the Instant inclusion is probably due to the versioning requirements. So that makes more sense to me now. It does seem like a filter is missing here to return only the most recent contents.
Given the schema, it seems like the CassandraResource.basicContainmentTriples() method should include some filtering, or that the underlying view should include that. I am wondering if the duplicate rows are part of memento support or not. Perhaps some, but not all.. If the triple between parent and child already exists, then do we need the new row in order to update the timestamp?
Hmm, it strikes me that a POST or a PUT would simply update the object in question, creating a new mutable metadata record. The bug seems to in fact be that contains triples are being reported out redundantly for all revisions of the container, i.e. it is a GET error, rather than an error in the recorded data. (409s are for checksum mismatch or unsupported triples, not for Conflicting LDP paths, it seems.)
As an aside, it seems like my test scenario should check if the folder exists before POSTing a new copy of it, unless we want to test that function exhaustively.
Looks like the materialized view has this: .. WITH CLUSTERING ORDER BY (identifier ASC, created DESC) ..
So this means that the results should stream out in default order that way, with a series of mementos for each child in turn. The first one of each should be the latest one.
I think one fix is to add .distinct() to the basicContainmentTriples() method's streaming pipeline. Adding "DISTINCT" to the query of the current view does not work, due to C* restrictions of DISTINCT to the partition key, which does not currently include "identifier" only "container". If we update the materialized view to include the identifier in the partition key:
.. PRIMARY KEY ((container, identifier), created) ..
(added inner parens)
Then we can use DISTINCT container, identifier and throw away the extra column, getting one copy of each child relationship. (Have to ALLOW FILTERING in the query also..)
I have a test of this ready to go, whenever I can get back to some decent wifi for the builds.
I will check the triples stored for the container for the contains relationship, since I still find one extra triple in the GET output.
Here is the example of a container that shows duplicate relationships. This is after a distinct() was added to the basicContainmentTriples() method:
jansen@X1:~$ curl http://ciber-vs1.umd.edu:10080/srv/ciber/Transfer+Notes
<http://ciber-vs1.umd.edu:10080/srv/ciber/Transfer+Notes>
<http://purl.org/dc/terms/extent> "2" ;
<http://purl.org/dc/terms/title> "Transfer+Notes" ;
<http://www.w3.org/ns/ldp#contains> <http://ciber-vs1.umd.edu:10080/srv/ciber/Transfer+Notes/nara1_vault10> ;
<http://www.w3.org/ns/ldp#contains> <http://ciber-vs1.umd.edu:10080/srv/ciber/Transfer+Notes/nara1_vault6> ;
<http://www.w3.org/ns/ldp#contains> <http://ciber-vs1.umd.edu:10080/srv/ciber/Transfer+Notes/nara2_vault10> ;
<http://www.w3.org/ns/ldp#contains> <http://ciber-vs1.umd.edu:10080/srv/ciber/Transfer+Notes/nara1_vault10> ;
<http://www.w3.org/ns/ldp#contains> <http://ciber-vs1.umd.edu:10080/srv/ciber/Transfer+Notes/nara1_vault6> ;
<http://www.w3.org/ns/ldp#contains> <http://ciber-vs1.umd.edu:10080/srv/ciber/Transfer+Notes/nara2_vault10> .
When I look at the mutabledata table for this identifier, I see that the latest row has this quads text:
<trellis:data/srv/ciber/Transfer+Notes> <http://purl.org/dc/terms/title> "Transfer+Notes" <http://www.trellisldp.org/ns/trellis#PreferUserManaged> .\n<trellis:data/srv/ciber/Transfer+Notes> <http://purl.org/dc/terms/extent> "2" <http://www.trellisldp.org/ns/trellis#PreferUserManaged> .\n
There are various timestamped records that are identical to this one and I do not see any contains relationships in the stored quads. Note that there are no quads stored for this identified in the immutabledata table.
Okay, I can confirm that after adding distinct() in CassandraResource.basicContainmentTriples() results in exactly 2 copies of each containment triple in output. It looks to me as if the logic of PreferContainment and PreferMembership in the calling code (CassandraResource#125) is not mutually exclusive.
Awesome, @gregjan, that's a really precise diagnostic. I'll get on a fix right away.
Hang on, before I do: @acoburn, take a look at @gregjan's comment above. If he's right, then is it a question of the backend emitting distinct quads and the HTTP layer creating triples by dropping the graph slot? That's not wrong, from the semantic POV (and the responses at HTTP that @gregjan is seeing aren't wrong, in that sense) but it does seem to be surprising. Do we need to add ::distinct
up there, or should we make a claim about expectations? I tend to the former position.
Or again, @gregjan, am I misunderstanding the problem: are the other Trellis backends you've seen showing this problem?
Good question, I will start one and see.
On Thu, Mar 21, 2019 at 1:58 PM A. Soroka notifications@github.com wrote:
Or again, @gregjan https://github.com/gregjan, am I misunderstanding the problem: are the other Trellis backends you've seen showing this problem?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/trellis-ldp/trellis-cassandra/issues/32#issuecomment-475339416, or mute the thread https://github.com/notifications/unsubscribe-auth/AACFztI6BHIj28KE1KrmhtmFZ5eJMVsiks5vY8hdgaJpZM4Y52Zt .
I just managed not to duplicate this on trellis external db (version 2.1-SNAPSHOT). Added three resources at the root. They were only reported once as "contains" objects. Then I verified the non-dupe at one level below root, inside a container.
Just verified it on trellis-extdb:0.3.0-SNAPSHOT image too.
Interesting, so it's not what I suggested above, and what's more, it's not quite what @gregjan suggested. @gregjan, can you show me an example request that you're making to get this behavior? I don't think it could be a case of the triple being added to both graphs, because there's no way I can see them being both being flattened and streamed to triples. But I'm not sure what your settings are. Are you sending any Prefer
headers?
Also, not to beat the logging horse, but this line does give us the option to turn logging up to TRACE
and see what containment triples are being built.
Hmm, my recent test added containers to other containers. Perhaps I should stick to my original case, which was adding binaries to containers.. BRB
Here is a link to my session that duplicates it using container objects this time. Didn't set any Prefer headers in my first GET, but I did in a second request (include PreferContainment). https://gist.github.com/gregjan/115c860921d259ea151c87703e5cccfd
Now when I omit PreferContainment, I see only one contains relationship:
jansen@X1:~$ curl -H "Prefer: return=representation; omit=\"http://www.w3.org/ns/ldp#PreferContainment\"" http://ciber-vs1.umd.edu:10080/cbad3300-337f-40f9-aeaf-592425dc2c9d
<http://ciber-vs1.umd.edu:10080/cbad3300-337f-40f9-aeaf-592425dc2c9d>
<http://www.w3.org/ns/ldp#contains> <http://ciber-vs1.umd.edu:10080/cbad3300-337f-40f9-aeaf-592425dc2c9d/a6a6b759-c0fb-46e0-a1b2-4e13002ff0ec> .
I apologize if this is the intended functionality w/regards to spec!
Aha! And just to be clear, is that the exact same sequence you used for the other backends, including the last request with PreferContainment
omitted?
Right, that's what I'm trying to figure out. It's surprising no matter what, so we probably have some work to do, but I'm trying to figure out whether t-c*
is doing something wrong or whether Trellis is doing everything right ("semantically" and to the spec) but could be a little more "ergonomic" by dropping duplicate triples. Or something else.
Nope. In fact I never set Prefer headers in the test of the DB back end. That test only showed one contains relationship w/o Prefer header set, so I didn't see a need at that point.
Ok, are you saying that this sequence is a test that distinguishes t-c*
from the other backends? Duplicate triples from t-c*
but not from the others? I'm sorry to be so finicky about this, but it's to a good cause; I'm going to reproduce the behavior.
NOTE: I just ran the same test on the DB back end. If I Prefer to omit containment representations there, then I don't see any contains relationships reported at all.
It looks to me that the other back end (DB) does not consider the contains triples to be a part of the PreferMembership representation, only the PreferContainment representation.
Okay, so t-c*
is emitting one containment triple when you omit PreferContainment
and two when you do not omit it?
Yes, indeed.
Ok, cool. Can you possibly run that sequence (set up, request with no prefers, request with omit PreferContainment
) at logging levels DEBUG
for org.trellisldp
and TRACE
for edu.si
? That should enable me to see where Trellis is changing its calls to the backend for those different requests. Then I can figure what the backend is doing differently, if anything.
Sure thing. Back in 5 mins or so.
I thought this had trace turned on, but I don't see any TRACE lines. Anyway, here is my first attempt with both session and logs: https://gist.github.com/gregjan/ce621282223141e1e53ccb68843f17f7
Turning logging back up and running again.. 10 mins
Ouch, that build failed to start for some reason related to the config provider. I'll have to dig into that tomorrow. Hope that the debug session above helps somewhat in the meantime.
@acoburn, are you using Resource::stream(Collection<IRI>)
and :: stream(IRI)
to select for different Prefer
headers?
I was running performance tests. What I found after two tests, between which I failed to reset the database, was that the root folder showed two identical ldp:contains relationships. Presumably there is only one contained resource, since the object of both triples was the same. I will look at the C* tables and report what I find there in a follow up comment.