openannotation / annotator-store

A backend store for the Annotator
http://annotateit.org/
MIT License
177 stars 66 forks source link

110 multiple document equivalences #112

Closed gergely-ujvari closed 9 years ago

gergely-ujvari commented 9 years ago

This PR offers a fix for the faulty document plugin equivalence mechanism.

A few words about the old way of working Every time an annotation was saved with document metadata we checked if there were any previously existing document which has the same url.

If there isn't any, we create a new document with the metadata and save it Otherwise, we merge the links into the first document (and only the first) and discard the rest of the metadata.

Changes The method used for retrieving documents called get_all_by_uris() was not helpful for discovering equivalences. For example if documentA had linkA and linkB (meaning that the content on those URI is equivalent) and documentB had linkB and linkC, then if I'd call this get_all_by_uris() with the URI linkA then it'd only retrieve documentA even though here we also need to get documentB back too.

So, instead of this, the get_all_recursive_for_uris() is used which builds a Kleene-star of documents based on the given URIs as seed URIs, with that we have every document we have to work with. (This operation would rarely take long, as this PR offers an aggressive solution to merge equivalent documents into one)

And now saving the document happens inside save_document_data()

The rest of the data is discarded, so as it was before, so that hasn't changed.

Treora commented 9 years ago

Cool stuff. I'll have a look at this next week.

gergely-ujvari commented 9 years ago

Thanks @Treora

tilgovi commented 9 years ago

Very nice.

I would suggest:

These last two bullet points are, taken together, something like, "Make the public interface simple and have it just do the right thing rather than making the public interface larger".

gergely-ujvari commented 9 years ago

Okay, I think I've addressed all your comments. Thank you @Treora and @tilgovi for improving this PR.

gergely-ujvari commented 9 years ago

I've removed all private function test_cases.

What we currently have is a test case for all public functions. @tilgovi: Do you want to remove them too?

tilgovi commented 9 years ago

Definitely not! The test cases for the public functions are exactly what we do want!

gergely-ujvari commented 9 years ago

I've a question. It seems that my changes are sometimes breaking the python 3.3 and 3.4 travis builds.

Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.3.5/lib/python3.3/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/travis/build/openannotation/annotator-store/tests/test_document.py", line 280, in test_save_merge_documents
    assert d1 is None
nose.proxy.AssertionError: 

(https://github.com/openannotation/annotator-store/blob/110-multiple-document-equivalences/tests/test_document.py#L280)

A simple refresh then builds them, should I maybe add some timing mechanism into the test case? Maybe I'm fetching the doc to fast, and ES didn't have a chance to delete it?

tilgovi commented 9 years ago

If we're testing:

That seems great.

tilgovi commented 9 years ago

A simple refresh then builds them, should I maybe add some timing mechanism into the test case? Maybe I'm fetching the doc to fast, and ES didn't have a chance to delete it?

Definitely don't do anything time based because that's bound to be fragile. Use the explicit refresh feature of ES to trigger a refresh of the index from your tests if you need to. However, the save function defaults to refresh=True. See where else you might need something similar. Also, see where you might avoid a refresh if you're doing multiple operations in a row.

gergely-ujvari commented 9 years ago

@tilgovi: Thanks I've added a forgotten refresh for the bulk update

About the test cases:

tilgovi commented 9 years ago

This is probably fine. Two questions:

tilgovi commented 9 years ago

Finally, I'm wondering if it's better to merge other documents into self while saving rather than merging self and others into the first result. I think the way it is now a race condition could result in losing data (two processes merge into different docs and delete the others) whereas the other way the worst case is we have duplicates (that get cleaned up at the next save).

gergely-ujvari commented 9 years ago

Okay, now I always merge into self, and introduced the maximum iteration number.

gergely-ujvari commented 9 years ago

Any other comments?

Treora commented 9 years ago

Any other comments?

Just a note for the longer term: I think that the approach in general was and is a bit too naive; any annotation can add new equivalences and we don't track which are added when or by who, so undoing a change is impossible.

My preferred long term solution would be to switch to a design where the equivalence between documents can be stated as a 'semantic annotation', and such information can thus also be exchanged and federated. A data structure like we have now would then still be desired, but would just serve as an index for quick lookup, not as the main database.

tilgovi commented 9 years ago

Fully agreed. I've thought about submitting all this information as annotations.

User X annotated U with the body "link rel=canonical C".

In LD-speak, the body serves as a named graph to keep the statements user-attributed and separate from one another.

tilgovi commented 9 years ago

But for now I think we should do this. Only question remaining is whether we should whittle the API down even more. Should we have a public get_by_uri or should we just be using the search method?

gergely-ujvari commented 9 years ago

@Treora: Yes, I agree. This PR is merely for fixing the errors of the current implementation.

@tilgovi: I like this annotation-graph idea. :+1:

About get_by_uri : I think it is good to leave this function as part of the public interface. Yes, currently its implementation is just a search, but as we want to implement a way different method to get a doc for uri in the future (like giving back a component of a graph) that cannot be replaced with a single search, IMO it is good to use this public method as an interface and the others do not have to care what we do inside the class.

gergely-ujvari commented 9 years ago

Okay, so any other changes? For the record: I'm voting for letting get_by_uri() be part of the public interface.

tilgovi commented 9 years ago

good stuff