treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.48k stars 359 forks source link

[Bug]: Duplicate uploaded image included in diff #6075

Closed dsgibbons closed 1 year ago

dsgibbons commented 1 year ago

What happened?

Current Behavior:

Let's assume there exists an image my_image.jpeg in a lakeFS repo. Let's say I upload the image again via lakectl. When I look at the diff in the UI, it says that my_image.jpeg has been modified (even if the contents is identical). I get the impression that when I commit, I have two copies of the file in the underlying S3. This seems like unnecessary duplication.

Steps to Reproduce:

  1. Upload an image to a repo
  2. Commit repo changes
  3. Reupload the image
  4. Check the diff and observe redundant changes

Expected Behavior

Expected Behavior:

When working with delta lakes, I can upload a new version of the delta lake and only the new parquet files will appear in the diff. I'd expect the same behavior for all file types.

lakeFS Version

0.100.0

Deplyoment

local

Affected Clients

lakectl 0.101.0

Relevant logs output

No response

Contact Details

No response

N-o-Z commented 1 year ago

@dsgibbons Thank you for reporting the issue. To better understand it, can you please supply the exact flow and commands to reproduce it?

itaiad200 commented 1 year ago

Hey @dsgibbons , lakeFS deduplication mechanism (the part that identifies dups online) was disabled a while ago. While it had its benefits, it has become to risky to managed without causing data corruptions and maintaining lakeFS ACID guarantees. To delete stale data, we recommend our users to run GC. In your specific case, with the proper configuration, it would have deleted the previous version of the object.