Closed kaplun closed 7 years ago
@mihaibivol, @jacquerie what do you think WRT performance?
Ugh, hard to tell. Performance depends on how the record versions are stored, which is something I'm not sure about. Assuming it's a list of JSONPatches, you have to walk them in order of creation to determine for each field the last patch that affected it. In order for this to work, each patch should have metadata about the author of the patch (which I don't think we are currently storing).
If we are going to use the merger that I am writing then it should also preserve the blame for merged list entries. I have to give it a more decent thought but for now it doesn't seem trivial.
So, invenio-records
natively stores the full json as history. Nothing prevents us to archive the diff as well if that can allow for better performance.
The inveniosoftware crew is kickstarting OAIS components such as submission information package support (which is not necessary here) and audit log (which in case we can easily implement as well, if we are able to capture who the user is).
I believe it's rather more complex to implement than having a performance problem. The worst that we could do is store a blame having the size of the actual record. e.g.
{
"titles": ["mihaib", "HARVESTER"],
"authors": [
{"full_name": "mihaib"},
{"full_name": "HARVESTER"}
]
}
This would be the least tricky thing to implement in my current project so that blames are kept even after a merge. This would also mean almost doubling the size of each record.
The alternative would be to compress any object that has a single source without going to the leaf nodes.
e.g.
{
"titles": "HARVESTER",
"authors": ["HARVESTER", "mihaib"]
}
This should not be such a huge overhead but would be super tricky to implement inside the merger and differ.
Assuming that we have a title list and some user adds a new title:
OLD_ONE
{
"titles": [{"value": "Preprint Title", "source": "arXiv"}]
}
NEW_ONE_V1
{
"titles": [{"value": "Cool UserTitle", "source": "cooluser"}, {"value": "Preprint Title", "source": "arXiv"}]
}
NEW_ONE_V2
"titles": [{"value": "Preprint Title", "source": "arXiv"}, {"value": "Cool UserTitle", "source": "cooluser"}]
In this case dictdiffer alone would produce for V1 a blame that looks like this
{
"titles": ["cooluser", "cooluser"]
}
because it won't align the list entities but rather see that cooluser added element 1
and edited element 2
On the other hand, for V2 the blame would look like this
{
"titles": ["HARVESTER_ARXIV", "cooluser"]
}
because we just appended a new thing and didn't touch titles[0]
json-merger
which is still a WIP and adding this use case to its design from now would push even further a first working prototype. Also we don't have any benchmarks to see how long does it take to add a correct put_blame
receiver for each of the records there.dictdiffer
output and the way it produces output for lists then basically it won't be a huge overhead for adding a put_blame
receiver as it would just walk the versions in parallel once.
On another thought, we could do something like:
def put_blame(old_rec, new_rec, source):
diff = dictdiffer.diff(old_rec, new_rec)
store_diff(diff, source)
all_diffs = get_all_diffs_for_record(new_rec)
blame = compute_blame(all_diffs)
# to be fast when displaying also store this
store_blame(blame)
def compute_blame(record, all_diffs):
blame = {}
for diff, source in all_diffs:
# instead of (ADD, 'key1.key2', 'value') put
# (ADD, 'key1.key2', source)
blame_diff = put_source_instead_of_body(diff, source)
rec = dictdiffer.patch(rec, diff)
blame = dictidffer.patch(blame, blame_diff)
return blame
For catalogers and power-users wishing to check the history of a record it would be great to support a sort of git blame of the record:
This could be based on a YAML representation of the record and on
dict-differ
with an audit log. I wonder though about the performance impact.