thinkaurelius / titan

Distributed Graph Database
http://titandb.io
Apache License 2.0
5.25k stars 1.02k forks source link

Titan-Lucene returns edges that don't meet LESS_THAN predicate #788

Open rsteppac opened 9 years ago

rsteppac commented 9 years ago

I have a strange issue with titan-lucene. Under a certain condition titan-lucene returns edges that do not meet the predicate requirement LESS_THAN. The purpose of my little application is to sync an external source (LDAP) with a Titan graph. A full synchronization

  1. sends all entries in the LDAP
  2. I "touch" all vertices and edges that come past me during the synchronization process
  3. delete all graph elements that have an update timestamp from before when the synchronization process started (were not touched)

The following is reproducible with titan-lucene 0.5.0 and 0.5.1.

During development I noticed the following behavior. No changes are made to the LDAP contents during this procedure.

  1. delete the Lucene index folder
  2. start full synchronization; at the end of the process no vertices or edges are deleted, as expected
  3. kill java process with SIGKILL (Eclipse standard behavior)
  4. start full synchronization; at the end of the process all edges are deleted, not expected
  5. kill java process with SIGKILL
  6. start full synchronization; at the end of the process no vertices or edges are deleted, as expected

Repeating steps 5 and 6 will always yield the expected result. Only the second run after the index has been created from scratch exhibits the faulty behavior.

The index is created as follows:

PropertyKey propUpdatedAt = tm.makePropertyKey(PropertyKeys.UPDATED_AT.name()).cardinality(Cardinality.SINGLE).dataType(Long.class).make();
tm.buildIndex(PropertyKeys.UPDATEDATIDXMIXEDEDGE.name(), Edge.class).addKey(propUpdatedAt).buildMixedIndex("locallucene");

The index is queried as follows:

// cutoffTimestamp is a long
query.has(PropertyKeys.UPDATED_AT.name(), Compare.LESS_THAN, cutoffTimestamp).edges();

In step 4 the above line returns an iterator that will deliver all edges in the graph. Inverting the predicate to GREATER_THAN does not invert the behavior. It always deletes all edges (that have been touched during synchronization). I have debugged the code and in step 4 the returned edges definitely have a modification timestamp > cutt-off timestamp. My unit tests that insert vertices and edges with predefined modification timestamps and then test the deletion of outdated elements are all green. At the end of the tests the index and DB directory are wiped out. So the tests are equivalent to step 2. The DB configuration is

storage.backend=berkeleyje
storage.directory=/tmp/berkeleydb
storage.transactions=true
schema.default=none
index.locallucene.backend=lucene
index.locallucene.directory=/tmp/searchindex
rsteppac commented 9 years ago

In the meantime I found another scenario that causes the opposite to happen, i.e. no instead edges are matched by the LESS_THEN predicate:

  1. insert several vertices and connect with edges
  2. add some property to edge
  3. commit
  4. load edge and update property
  5. commit
  6. Iterate over all edges (graph.query().edges()) in graph and log cut-off timestamp property to verify there are edges that meet LESS_THEN criteria
  7. select edges by cut-off timestamp -> no edges are selected

As soon as step 4 is removed from the process the expected number of edges gets selected. I am loading the edge in step 4 via a unique-indexed property (that is not the property I am updating), don't know if that makes a difference.

rsteppac commented 9 years ago

You can find a small sample project with a unit test that demonstrates the problem described in my last comment here: https://github.com/rsteppac/titan788 With the stripped down test I was able to see that loading and updating an edge property is not what causes the issue, only the commit after the action.

dalaro commented 9 years ago

Thanks for the test case. I'm going to describe what I saw. Maybe we can compare notes.

When fetching edges LESS_THAN the timestamp, Titan finds three hits in Lucene. Two are expected hits, and the third a third "ghost" document that should have been deleted between tests. The ghost doesn't appear in TermQuery results and can't be deleted by its indexed docid Term, but it does still appear in MatchAllDocsQuery and apparently searches on its UPDATED_AT property. No assertions have failed yet, but this is where the trouble starts in terms of faulty state/behavior. This behavior seems to be dependent on updating a field on the document before attempting to delete it.

The ResultSetIterator of LESS_THAN edges presented to the test code in the failing test method appears empty. It actually contains three elements. But RSI treats the first null edge in its encapsulated list (really an iterator) as the end of the list. In the test, the ghost edge produces the first member of RSI's encapsulated list. Since it exists only in Lucene, not in BDB, it's null. The remaining two edges are sitting in memory in RSI's encapsulated list just behind the null corresponding to the ghost edge, but they're never visible to the test code due to the way RSI works.

The same test works with ES when I insert a sleep after the data writes to let ES's reindex interval elapse.

I'm not sure what Titan's doing wrong with the Lucene API. I made a test case that just uses Lucene types, without any Titan dependencies, in an attempt to boil this down even further: https://github.com/dalaro/lucene-doc-update-test. I don't understand why Lucene behaves the way it does in this test case. I may have misunderstood a method contract or config option somewhere.

rsteppac commented 9 years ago

I have never used the Lucene API myself. But looking at your test and and the API documentation and some posts on blog.mikemccandless.com my understanding is the same as yours, i.e. the test should return the updated document after the second commit. The NRT readers are specifically promoted for the purpose of immediate visibility. You should not even have to commit the writer. I tried with your test and varied the Lucene version and the API use a little, but always with the same effect: The document does not show in the search results after it is updated.

rsteppac commented 9 years ago

Could you please advise on the status of this ticket? Does asigining the ticket to "consideration" mean it won't be resolved any time soon?

mbroecheler commented 9 years ago

We were able to reproduce this in Lucene natively and now are a little bit stumped as to how to resolve this, hence the tag.