Closed gergely-ujvari closed 9 years ago
Cool stuff. I'll have a look at this next week.
Thanks @Treora
Very nice.
I would suggest:
get_all_by_uris
the only public interface. Or better yet, get_by_uri
since it's expected to only ever return one..save()
so that it's just doing the right thing.These last two bullet points are, taken together, something like, "Make the public interface simple and have it just do the right thing rather than making the public interface larger".
Okay, I think I've addressed all your comments. Thank you @Treora and @tilgovi for improving this PR.
I've removed all private function test_cases.
What we currently have is a test case for all public functions. @tilgovi: Do you want to remove them too?
Definitely not! The test cases for the public functions are exactly what we do want!
I've a question. It seems that my changes are sometimes breaking the python 3.3 and 3.4 travis builds.
Traceback (most recent call last):
File "/home/travis/virtualenv/python3.3.5/lib/python3.3/site-packages/nose/case.py", line 198, in runTest
self.test(*self.arg)
File "/home/travis/build/openannotation/annotator-store/tests/test_document.py", line 280, in test_save_merge_documents
assert d1 is None
nose.proxy.AssertionError:
A simple refresh then builds them, should I maybe add some timing mechanism into the test case? Maybe I'm fetching the doc to fast, and ES didn't have a chance to delete it?
If we're testing:
That seems great.
A simple refresh then builds them, should I maybe add some timing mechanism into the test case? Maybe I'm fetching the doc to fast, and ES didn't have a chance to delete it?
Definitely don't do anything time based because that's bound to be fragile. Use the explicit refresh feature of ES to trigger a refresh of the index from your tests if you need to. However, the save function defaults to refresh=True. See where else you might need something similar. Also, see where you might avoid a refresh if you're doing multiple operations in a row.
@tilgovi: Thanks I've added a forgotten refresh for the bulk update
About the test cases:
This is probably fine. Two questions:
Finally, I'm wondering if it's better to merge other documents into self while saving rather than merging self and others into the first result. I think the way it is now a race condition could result in losing data (two processes merge into different docs and delete the others) whereas the other way the worst case is we have duplicates (that get cleaned up at the next save).
Okay, now I always merge into self, and introduced the maximum iteration number.
Any other comments?
Any other comments?
Just a note for the longer term: I think that the approach in general was and is a bit too naive; any annotation can add new equivalences and we don't track which are added when or by who, so undoing a change is impossible.
My preferred long term solution would be to switch to a design where the equivalence between documents can be stated as a 'semantic annotation', and such information can thus also be exchanged and federated. A data structure like we have now would then still be desired, but would just serve as an index for quick lookup, not as the main database.
Fully agreed. I've thought about submitting all this information as annotations.
User X annotated U with the body "link rel=canonical C".
In LD-speak, the body serves as a named graph to keep the statements user-attributed and separate from one another.
But for now I think we should do this. Only question remaining is whether we should whittle the API down even more. Should we have a public get_by_uri
or should we just be using the search
method?
@Treora: Yes, I agree. This PR is merely for fixing the errors of the current implementation.
@tilgovi: I like this annotation-graph idea. :+1:
About get_by_uri
: I think it is good to leave this function as part of the public interface. Yes, currently its implementation is just a search, but as we want to implement a way different method to get a doc for uri in the future (like giving back a component of a graph) that cannot be replaced with a single search, IMO it is good to use this public method as an interface and the others do not have to care what we do inside the class.
Okay, so any other changes? For the record: I'm voting for letting get_by_uri()
be part of the public interface.
good stuff
This PR offers a fix for the faulty document plugin equivalence mechanism.
A few words about the old way of working Every time an annotation was saved with document metadata we checked if there were any previously existing document which has the same url.
If there isn't any, we create a new document with the metadata and save it Otherwise, we merge the links into the first document (and only the first) and discard the rest of the metadata.
Changes The method used for retrieving documents called get_all_by_uris() was not helpful for discovering equivalences. For example if
documentA
hadlinkA
andlinkB
(meaning that the content on those URI is equivalent) anddocumentB
hadlinkB
andlinkC
, then if I'd call thisget_all_by_uris()
with the URIlinkA
then it'd only retrievedocumentA
even though here we also need to getdocumentB
back too.So, instead of this, the get_all_recursive_for_uris() is used which builds a Kleene-star of documents based on the given URIs as seed URIs, with that we have every document we have to work with. (This operation would rarely take long, as this PR offers an aggressive solution to merge equivalent documents into one)
And now saving the document happens inside save_document_data()
The rest of the data is discarded, so as it was before, so that hasn't changed.