Closed brianjlowe closed 1 year ago
Delay is suspiciously similar to the delay incurred in listing the named graphs in use in a TDB instance.
Indeed, grabbing thread dumps during the delays consistently shows calls to getGraphURIs(), which is very slow on TDB. Not sure yet why this is happening on login/logout and not on other requests.
Login example:
"ajp-nio-127.0.0.1-8009-exec-9" #50 daemon prio=5 os_prio=0 cpu=83850.94ms elapsed=610488.84s tid=0x00007fc1d8be4800 nid=0x3460a5 runnable [0x00007fbfe95e5000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.hash(java.base@11.0.18/HashMap.java:340) at java.util.HashMap.put(java.base@11.0.18/HashMap.java:608) at java.util.HashSet.add(java.base@11.0.18/HashSet.java:220) at org.apache.jena.atlas.iterator.FilterUnique.test(FilterUnique.java:32) at org.apache.jena.atlas.iterator.Iter$1.hasNext(Iter.java:228) at org.apache.jena.atlas.iterator.Iter.hasNext(Iter.java:1067) at org.apache.jena.atlas.iterator.Iter$2.hasNext(Iter.java:347) at org.apache.jena.util.iterator.WrappedIterator.hasNext(WrappedIterator.java:90) at org.apache.jena.util.iterator.MapFilterIterator.hasNext(MapFilterIterator.java:54)
Logout example: "ajp-nio-127.0.0.1-8009-exec-2" #43 daemon prio=5 os_prio=0 cpu=92412.14ms elapsed=610693.55s tid=0x00007fc1d86f1000 nid=0x34609e runnable [0x00007fbfe9cec000] java.lang.Thread.State: RUNNABLE at org.apache.jena.tdb.base.buffer.RecordBuffer._get(RecordBuffer.java:107) at org.apache.jena.tdb.base.buffer.RecordBuffer.get(RecordBuffer.java:53) at org.apache.jena.tdb.base.recordbuffer.RecordRangeIterator.hasNext(RecordRangeIterator.java:119) at org.apache.jena.atlas.iterator.Iter$2.hasNext(Iter.java:347) at org.apache.jena.tdb.sys.DatasetControlMRSW$IteratorCheckNotConcurrent.hasNext(DatasetControlMRSW.java:111) at org.apache.jena.atlas.iterator.Iter$2.hasNext(Iter.java:347) at org.apache.jena.atlas.iterator.Iter$1.hasNext(Iter.java:223) at org.apache.jena.atlas.iterator.Iter.hasNext(Iter.java:1067) at org.apache.jena.atlas.iterator.Iter$2.hasNext(Iter.java:347) at org.apache.jena.util.iterator.WrappedIterator.hasNext(WrappedIterator.java:90) at org.apache.jena.util.iterator.MapFilterIterator.hasNext(MapFilterIterator.java:54)
Maybe UF sample data can be used to reproduce the issue - https://github.com/vivo-project/sample-data/pull/9
The fact that this not a problem on all requests may be related to the graph URI cache in RDFServiceJena. The changeSetUpdate() method in RDFServiceTDB sets a dirty flag on this cache whenever a change set is written to the triple store. A change is written whenever someone logs in (not sure about logout), because we record login count and timestamp of last login.
It would initially seem that any request immediately following an update or edit should be similarly slow, not just logins.
It seems that we could at least make that cache invalidation a bit smarter because the change set offers easy access to the name graphs it's modifying. If there are only named graphs added, we can add them to the cache directly. Only if there is a named graph involved in a triple removal do we need to invalidate the cache, but even then it would probably be faster to test whether the graph is empty with something like "SELECT ?s ?p ?o WHERE { GRAPH ?g { ?s ?p ?o } } LIMIT 1" or even "ASK { GRAPH ?g { ?s ?p ?o } }" rather than re-listing all the graphs in the store.
Another option might be to rebuild the cache in the background when update activity is quiescent the way the search indexer works. I don't think there are any situations where a request will care that the graph URI list is slightly stale; in older versions of VIVO I remember freezing the graph URI list after startup and the only ill effect was that visiting the Manage Jena Models page didn't show new graphs that had appeared on ingest. Normal requests were unaffected.
It would be nice not to block incoming requests at all on these cache rebuilds.
Describe the bug On VIVO 1.13, after submitting email and password, login takes many seconds to complete and redirect on a populated instance. Approximately 5 seconds on an instance with 10 million triples. Thanks to Michael Bentz for originally reporting this issue on Slack.
To Reproduce
Environment (please complete the following information):