Closed dsgibbons closed 1 year ago
@dsgibbons Thank you for reporting the issue. To better understand it, can you please supply the exact flow and commands to reproduce it?
Hey @dsgibbons , lakeFS deduplication mechanism (the part that identifies dups online) was disabled a while ago. While it had its benefits, it has become to risky to managed without causing data corruptions and maintaining lakeFS ACID guarantees. To delete stale data, we recommend our users to run GC. In your specific case, with the proper configuration, it would have deleted the previous version of the object.
What happened?
Current Behavior:
Let's assume there exists an image
my_image.jpeg
in a lakeFS repo. Let's say I upload the image again vialakectl
. When I look at the diff in the UI, it says thatmy_image.jpeg
has been modified (even if the contents is identical). I get the impression that when I commit, I have two copies of the file in the underlying S3. This seems like unnecessary duplication.Steps to Reproduce:
Expected Behavior
Expected Behavior:
When working with delta lakes, I can upload a new version of the delta lake and only the new parquet files will appear in the diff. I'd expect the same behavior for all file types.
lakeFS Version
0.100.0
Deplyoment
local
Affected Clients
lakectl 0.101.0
Relevant logs output
No response
Contact Details
No response