treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.48k stars 359 forks source link

Merge response contains zero values #3978

Closed itaiad200 closed 1 year ago

itaiad200 commented 2 years ago

We don't calculate the statistics of the merge operation and return a zero list regardless of the real result:

{'summary': {'added': 0, 'changed': 0, 'conflict': 0, 'removed': 0}}

If we want to continue supporting that response body we ought to return the real numbers.

arielshaqed commented 2 years ago

AFAIR this has always been a bit suspect. (I think) I initially added it in some misguided attempt at feature parity with Git. Later it became too hard to support and we removed it; I think one issue is that whole-range operations become tough (does the size of a range appear on the metarange?), counting double deletes is surprisingly hard to do correctly, staging deletes are (or used to be) hard to count, etc.

I think it should be safe to get rid of it. It's been all-zeroes for a long while now, and I have seen ZERO complaints. Adding the dreaded "1.0" label as this is technically a breaking change :-)

ozkatz commented 2 years ago

@arielshaqed I have seen > ZERO complaints :) I don't think there's a technical reason for this not to work (meta-ranges contain key counts for each range). Staging is less relevant since we're talking about a merge operation..

arielshaqed commented 2 years ago

@arielshaqed I have seen > ZERO complaints :)

I don't think there's a technical reason for this not to work (meta-ranges contain key counts for each range). Staging is less relevant since we're talking about a merge operation..

You're right -- I was worried that we were not filling them in, but actually they are! The only issue that remains is that sometimes (hopefully usually) we replace an entire range, and then we don't know how many items it changed. We could overestimate the counts, but this would be wildly inaccurate.

Perhaps make it an option in the API...

nopcoder commented 2 years ago

I think we can suggest calling 'diff' after successful merge to get a merge results and removing the results from the merge API. This way we can keep the current merge optimizations.

arielshaqed commented 2 years ago

I think we can suggest calling 'diff' after successful merge to get a merge results and removing the results from the merge API. This way we can keep the current merge optimizations.

Thinking about it over the weekend, I think that this is exactly right. There is no significant advantage to returning an exact count. Meanwhile, for many (most?) programmatic application you would want to diff in order to do any of the following efficiently:

(If need be, even add a separate counting-diff operation.)

nopcoder commented 1 year ago

lakeFS v0.91.0 release will return summary, but the API was updated to have this field (and members) as optional. Separate PR will be merged later when we would like to drop the summary information from API.

nopcoder commented 1 year ago

https://github.com/treeverse/airflow-provider-lakeFS/pull/46 Upgrade AirFlow integration

itaiad200 commented 1 year ago

@nopcoder please add a reminder for x weeks to delete it from the API.

ozkatz commented 1 year ago

@nopcoder @itaiad200 when do we plan on introducing this breaking change?

itaiad200 commented 1 year ago

Waiting for more clients to upgrade to a version where the Summary field is Optional. It will probably take months.