treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.48k stars 359 forks source link

[Bug]: Import data from an object storage to a repo in LakeFS (keys must be added in strictly increasing order) #5751

Closed ChiShiang closed 1 year ago

ChiShiang commented 1 year ago

What happened?

Current Behavior: Hi there,

I am trying to import data from an object storage to a repository in lakeFS. The data in the object storage consists of more than 10,000 images. However, I am encountering a 500 internal server error when trying to import the data.

The error message is as follows:

writing range from entry iterator: writing record: setting key and value: pebble: keys must be added in strictly increasing order: data/[name_of_file]#0,SET, data/#0,SET

I have tried renaming all of the files using a numeric ordering or a random string (e.g. SSID), but the error still occurred.

On the other hand, I have noticed that importing less than 999 images works fine. However, when trying to import the entire dataset of over 999 images, the error occurs.

Steps to Reproduce:

  1. Upload all of images to an object storage
  2. import the data to repo by using lakectl or web page

Expected Behavior

Expected Behavior:

All data which stored in the s3 objects (images) will be imported to a specific repo in lakefs

lakeFS Version

0.96.1

Deplyoment

Other Cloud Storage (Support S3 Protocol)

Affected Clients

lakectl 0.99.0

Relevant logs output

writing range from entry iterator: writing record: setting key and value: pebble: keys must be added in strictly increasing order: data/[name_of_file]#0,SET, data/#0,SET

Contact Details

hyaline0317@gmail.com

itaiad200 commented 1 year ago

Hey @ChiShiang , thanks for reporting this issue! I wasn't able to reproduce it using the latest lakeFS release, can you try that version and see if it was resolved?

More questions:

  1. You tried both with lakectl (of an higher version?) and the UI, right?
  2. We've seen similar bugs recently with Azure blob store where funky empty dirs have the same name as some files. It should still work but might guide us to the right direction: Can you share more information on the imported location structure? e.g. s3://bucket/prefix/images/ where all images are flat under that location, or anything else.
ChiShiang commented 1 year ago

Hi @itaiad200 thanks for replying :)

I wasn't able to reproduce it using the latest lakeFS release, can you try that version and see if it was resolved?

I'll try it and reply the results after the update finished. :)

More questions:

  1. You tried both with lakectl (of an higher version?) and the UI, right?
  2. We've seen similar bugs recently with Azure blob store where funky empty dirs have the same name as some files. It should still work but might guide us to the right direction: Can you share more information on the imported location structure? e.g. s3://bucket/prefix/images/ where all images are flat under that location, or anything else.
  1. Yes, I got same error massage when I tried it on lakectl and UI.
  2. Sure, the structure of imported data looked like s3://bucket/prefix/images/*.jpg.

Example as below:

-----lakefs-imported [bucket]
               |-----project-A [prefix]
                             |-----images
                                         |-----000000.jpg
                                         |-----000001.jpg
                                         |-----000002.jpg
                                         |-----000003.jpg
                                         |-----000004.jpg
                                         ...
                                         |-----239331.jpg
N-o-Z commented 1 year ago

@ChiShiang can you please tell us what is the backing object store you are using?

itaiad200 commented 1 year ago

Hey @ChiShiang, we need some more input in order to identify the root cause. We suspect the compatible object storage in question has some caveats, like the order of the returned objects while listing. I suggest we try running the following command using aws cli that points to that backing storage. Please run it once with and once without the sort and compare the outputs.

aws s3 ls s3://lakefs-imported/project-A/images/ | awk '{print $4}'  > raw.txt
aws s3 ls s3://lakefs-imported/project-A/images/ | awk '{print $4}' | sort > sorted.txt
ChiShiang commented 1 year ago

Hi @N-o-Z, @itaiad200

can you try that version and see if it was resolved?

Unfortunately, the same issue occurred in the latest lakeFS version :\

can you please tell us what is the backing object store you are using?

I am currently using the object storage provided by TWCC, which is the Taiwan Computation Cloud service offered by the National Center for High Performance Computing in Taiwan. However, I suspect there may be compatibility issues with the object storage I am currently using. As a result, I am trying to run the same scenario using the MINIO backend.

I suggest we try running the following command using aws cli that points to that backing storage. Please run it once with and once without the sort and compare the outputs.

Okay, I'll share the results after trying these commends.

N-o-Z commented 1 year ago

@ChiShiang thank you for sharing the information with us. Please update us if you encounter the same issue with MinIO (should not occur). If you must use the TWCC storage service, we can provide you with a workaround for the import to work

ChiShiang commented 1 year ago

Hi @N-o-Z @itaiad200

Sorry for the late reply. Fortunately, the issue has been resolved after trying the lakeFS with MinIO backend.

Thank you for your kind help with this issue.

N-o-Z commented 1 year ago

Closing issue, please note that this will be resolved in https://github.com/treeverse/lakeFS/pull/5840 also for S3 implementations that do not list objects lexicographically