Git blame view for record

kaplun commented 8 years ago

For catalogers and power-users wishing to check the history of a record it would be great to support a sort of git blame of the record: sso1

This could be based on a YAML representation of the record and on dict-differ with an audit log. I wonder though about the performance impact.

kaplun commented 8 years ago

@mihaibivol, @jacquerie what do you think WRT performance?

jacquerie commented 8 years ago

Ugh, hard to tell. Performance depends on how the record versions are stored, which is something I'm not sure about. Assuming it's a list of JSONPatches, you have to walk them in order of creation to determine for each field the last patch that affected it. In order for this to work, each patch should have metadata about the author of the patch (which I don't think we are currently storing).

mihaibivol commented 8 years ago

If we are going to use the merger that I am writing then it should also preserve the blame for merged list entries. I have to give it a more decent thought but for now it doesn't seem trivial.

kaplun commented 8 years ago

So, invenio-records natively stores the full json as history. Nothing prevents us to archive the diff as well if that can allow for better performance.

The inveniosoftware crew is kickstarting OAIS components such as submission information package support (which is not necessary here) and audit log (which in case we can easily implement as well, if we are able to capture who the user is).

mihaibivol commented 8 years ago

Point 1

I believe it's rather more complex to implement than having a performance problem. The worst that we could do is store a blame having the size of the actual record. e.g.

{
  "titles": ["mihaib", "HARVESTER"],
  "authors": [
     {"full_name": "mihaib"},
     {"full_name": "HARVESTER"}
   ]
}

This would be the least tricky thing to implement in my current project so that blames are kept even after a merge. This would also mean almost doubling the size of each record.

Point 2

The alternative would be to compress any object that has a single source without going to the leaf nodes.

e.g.

{
  "titles": "HARVESTER",
  "authors": ["HARVESTER", "mihaib"]
}

This should not be such a huge overhead but would be super tricky to implement inside the merger and differ.

Point 3

Assuming that we have a title list and some user adds a new title:

OLD_ONE

{
  "titles": [{"value": "Preprint Title", "source": "arXiv"}]
}

NEW_ONE_V1

{
  "titles": [{"value": "Cool UserTitle", "source": "cooluser"}, {"value": "Preprint Title", "source": "arXiv"}]
}

NEW_ONE_V2

"titles": [{"value": "Preprint Title", "source": "arXiv"}, {"value": "Cool UserTitle", "source": "cooluser"}]

In this case dictdiffer alone would produce for V1 a blame that looks like this

{
   "titles": ["cooluser", "cooluser"]
}

because it won't align the list entities but rather see that cooluser added element 1 and edited element 2

On the other hand, for V2 the blame would look like this

{
   "titles": ["HARVESTER_ARXIV", "cooluser"]
}

because we just appended a new thing and didn't touch titles[0]

Last Points :)

We could try to use the json-merger which is still a WIP and adding this use case to its design from now would push even further a first working prototype. Also we don't have any benchmarks to see how long does it take to add a correct put_blame receiver for each of the records there.
If we are OK with dictdiffer output and the way it produces output for lists then basically it won't be a huge overhead for adding a put_blame receiver as it would just walk the versions in parallel once.
- But we would still have a storage overhead that could be twice the current size for each edit.
- We need to think about merging blames done by the merger (this shouldn't be a bottleneck of performance for the merger itself)
If we want to build a record as a set of revisions we could bump into a huge amount of problems. One of them being schema changes.

mihaibivol commented 8 years ago

On another thought, we could do something like:

def put_blame(old_rec, new_rec, source):
    diff = dictdiffer.diff(old_rec, new_rec)
    store_diff(diff, source)
    all_diffs = get_all_diffs_for_record(new_rec)
    blame = compute_blame(all_diffs)
    # to be fast when displaying also store this
    store_blame(blame)

def compute_blame(record, all_diffs):
    blame = {}
    for diff, source in all_diffs:
       # instead of (ADD, 'key1.key2', 'value') put
       # (ADD, 'key1.key2', source)
       blame_diff = put_source_instead_of_body(diff, source)
       rec = dictdiffer.patch(rec, diff)
       blame = dictidffer.patch(blame, blame_diff)
   return blame

StellaCh commented 7 years ago

Resolved in https://github.com/inspirehep/record-editor/issues/185

inspirehep / inspire-next