monterail / ghcr

[Project discontinued] GitHub Code Review Extension
https://github.com/monterail/ghcr
27 stars 1 forks source link

Solve problem with multiple occurrences of same commit #22

Closed Ostrzy closed 8 years ago

Ostrzy commented 11 years ago

Commits can be duplicated - after rebase or cherry picks we end up with multiple instances of same commit. Idea is to make hash from list of changed files, message and probably something else and check also by this hash if commit is already in db.

chytreg commented 10 years ago

How about to check pending commits only in master branch? https://github.com/monterail/mm4-contracts/commit/a3d2bca0ba502a5e852d423eb7ef0fbf3b8de58f vs https://github.com/monterail/mm4-contracts/commit/6ecaced4089746570c27906b976fb489050263af

@Ostrzy @teamon ^

teamon commented 10 years ago

It could be a solution, but sometimes serious errors are found in feature branches before it goes to master and that is z good thing to have

Tymon Tobolski CTO Monterail

On 23 Jan 2014, at 14:04, Dariusz Gertych notifications@github.com wrote:

How about to check pending commits only in master branch? monterail/mm4-contracts@a3d2bca vs monterail/mm4-contracts@6ecaced

@Ostrzy @teamon ^

— Reply to this email directly or view it on GitHub.

chytreg commented 10 years ago

Yes, but take a look at those two commits. How can we check similarity of them?

Ostrzy commented 10 years ago

I guess most problems come from rebases or cherry picks. I have that idea to concatenate commmit message, author data and changed/added/removed files, hash this and save in database. We can than check similiarity by it.

I have also just found that during rebase and cherry pick, Date is not changed, only Commit Date is added. So if github gives us timestamp based on Date and not Commit Date (and I suspect do, because both commits given by @chytreg have exact same timestamps) we can just save timestamp and check timestamp-author uniqueness.

I guess that commits should not appear on same second for anybody who is not using some kind of automation and in this case it should not be reviewed and therefor commits should be appropriately tagged.

teamon commented 10 years ago

Unless commit have been changed during rebase (because of e.g. merge conflicts)

Tymon Tobolski CTO Monterail

On 23 Jan 2014, at 15:03, Michał notifications@github.com wrote:

I guess most problems come from rebases or cherry picks. I have that idea to concatenate commmit message, author data and changed/added/removed files, hash this and save in database. We can than check similiarity by it.

I have also just found that during rebase and cherry pick, Date is not changed, only Commit Date is added. So if github gives us timestamp based on Date and not Commit Date (and I suspect do, because both commits given by @chytreg have exact same timestamps) we can just save timestamp and check timestamp-author uniqueness.

I guess that commits should not appear on same second for anybody who is not using some kind of automation and in this case it should not be reviewed and therefor commits should be appropriately tagged.

— Reply to this email directly or view it on GitHub.

Ostrzy commented 10 years ago

Wasn't there this discussion some time ago already? It was exact same reasoning to ignore merge commits. Conflict can break something in code, but this is not responsibility of reviewer to check this, it's yours. If commit was already accepted, than solving merge conflict should not break it's style and even if it does, we can live with it in better future where you don't have to review same commit multiple times.

teamon commented 10 years ago

I think this is valid reasoning and we should go that way :+1:

Tymon Tobolski CTO Monterail

On 23 Jan 2014, at 15:12, Michał notifications@github.com wrote:

Wasn't there this discussion some time ago already? It was exact same reasoning to ignore merge commits. Conflict can break something in code, but this is not responsibility of reviewer to check this, it's yours. If commit was already accepted, than solving merge conflict should not break it's style and even if it does, we can live with it in better future where you don't have to review same commit multiple times.

— Reply to this email directly or view it on GitHub.

chytreg commented 10 years ago

I making some test here: https://github.com/chytreg/test/commits/payloads What I found out problem occurs only when rebase branch X into master (X is pushed to remote).

I checked timestamps of thoses commits. They are different!!! @Ostrzy your argument is invalid. monterail/mm4-contracts@a3d2bca vs monterail/mm4-contracts@6ecaced

Next take a look for this example:

Is this our duplicated commit? Yes and NO.

Yes because I'm a human and I know that there same.

NO because:

In the commit payload you don't get the diff what has been changed.

chytreg commented 10 years ago

We could try to analyze stats of the similar commits, but we won't be 100% sure that commits are same http://developer.github.com/v3/repos/commits/#get-a-single-commit

chytreg commented 10 years ago

Or wait maybe patch attribute will save us:"patch": "@@ -29,7 +29,7 @@\n....."

szajbus commented 10 years ago

@chytreg I can't reproduce the situation when commit's timestamp changes. Can you describe the steps?

chytreg commented 10 years ago

Master add few commits New branch add few commits Master add few commits Rebase new branch into master

chytreg commented 10 years ago

I have used http://requestb.in to analyze messages from github paylods checkout https://github.com/chytreg/test/commits/payloads

Ostrzy commented 10 years ago

@chytreg is right, in webhook we get date of the commit, not date of autorship. It sucks, but we can get necessary data from api. We can call compare for before and after that we get from webhook. There is all necessary data we need - author email and authorship date. We can combine these two with repo id to create unique string. We should probably put github call to the background process and give commits some state before pending.

szajbus commented 10 years ago

@Ostrzy Good research. This could work.

sheerun commented 10 years ago

I think we can save both in raw form on backend for filtering purposes.

Ostrzy commented 10 years ago

There is small problem with the fact, that we have to access api through user, but it can be resolved by remembering the user that added repository to ghcr and access through his token (we are sure about access to repo). I have no idea what to do when e. g. user stops using app and deauthorize it, as we lose access to api for this repo. Unfortunately I am not able to take care about this issue before June.