inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

Git blame view for record #1091

Closed kaplun closed 6 years ago

kaplun commented 8 years ago

For catalogers and power-users wishing to check the history of a record it would be great to support a sort of git blame of the record: sso1

This could be based on a YAML representation of the record and on dict-differ with an audit log. I wonder though about the performance impact.

kaplun commented 8 years ago

@mihaibivol, @jacquerie what do you think WRT performance?

jacquerie commented 8 years ago

Ugh, hard to tell. Performance depends on how the record versions are stored, which is something I'm not sure about. Assuming it's a list of JSONPatches, you have to walk them in order of creation to determine for each field the last patch that affected it. In order for this to work, each patch should have metadata about the author of the patch (which I don't think we are currently storing).

mihaibivol commented 8 years ago

If we are going to use the merger that I am writing then it should also preserve the blame for merged list entries. I have to give it a more decent thought but for now it doesn't seem trivial.

kaplun commented 8 years ago

So, invenio-records natively stores the full json as history. Nothing prevents us to archive the diff as well if that can allow for better performance.

The inveniosoftware crew is kickstarting OAIS components such as submission information package support (which is not necessary here) and audit log (which in case we can easily implement as well, if we are able to capture who the user is).

mihaibivol commented 8 years ago

Point 1

I believe it's rather more complex to implement than having a performance problem. The worst that we could do is store a blame having the size of the actual record. e.g.

{
  "titles": ["mihaib", "HARVESTER"],
  "authors": [
     {"full_name": "mihaib"},
     {"full_name": "HARVESTER"}
   ]
}

This would be the least tricky thing to implement in my current project so that blames are kept even after a merge. This would also mean almost doubling the size of each record.

Point 2

The alternative would be to compress any object that has a single source without going to the leaf nodes.

e.g.

{
  "titles": "HARVESTER",
  "authors": ["HARVESTER", "mihaib"]
}

This should not be such a huge overhead but would be super tricky to implement inside the merger and differ.

Point 3

Assuming that we have a title list and some user adds a new title:

OLD_ONE

{
  "titles": [{"value": "Preprint Title", "source": "arXiv"}]
}

NEW_ONE_V1

{
  "titles": [{"value": "Cool UserTitle", "source": "cooluser"}, {"value": "Preprint Title", "source": "arXiv"}]
}

NEW_ONE_V2

"titles": [{"value": "Preprint Title", "source": "arXiv"}, {"value": "Cool UserTitle", "source": "cooluser"}]

In this case dictdiffer alone would produce for V1 a blame that looks like this

{
   "titles": ["cooluser", "cooluser"]
}

because it won't align the list entities but rather see that cooluser added element 1 and edited element 2

On the other hand, for V2 the blame would look like this

{
   "titles": ["HARVESTER_ARXIV", "cooluser"]
}

because we just appended a new thing and didn't touch titles[0]

Last Points :)

mihaibivol commented 8 years ago

On another thought, we could do something like:

def put_blame(old_rec, new_rec, source):
    diff = dictdiffer.diff(old_rec, new_rec)
    store_diff(diff, source)
    all_diffs = get_all_diffs_for_record(new_rec)
    blame = compute_blame(all_diffs)
    # to be fast when displaying also store this
    store_blame(blame)

def compute_blame(record, all_diffs):
    blame = {}
    for diff, source in all_diffs:
       # instead of (ADD, 'key1.key2', 'value') put
       # (ADD, 'key1.key2', source)
       blame_diff = put_source_instead_of_body(diff, source)
       rec = dictdiffer.patch(rec, diff)
       blame = dictidffer.patch(blame, blame_diff)
   return blame
StellaCh commented 6 years ago

Resolved in https://github.com/inspirehep/record-editor/issues/185