treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.42k stars 351 forks source link

[Bug]: lakectl fs upload Causes OOM Error During Large File Uploads #8088

Open andrijdavid opened 2 months ago

andrijdavid commented 2 months ago

What happened?

lakectl fs upload command causes an Out-Of-Memory (OOM) error, resulting in the process being killed by the kernel or freezing the OS during the upload of large files.

Environment:

Steps to Reproduce:

Expected behavior

File uploaded successfully

lakeFS version

1.31.1

How lakeFS is installed

GCP

Affected clients

All

Relevant log output

22352 Killed lakectl fs upload --source . --recursive "lakefs://${LAKEFS_REPO_NAME}/${DEFAULT_BRANCH}/" --pre-sign -p 8

Contact details

No response

andrijdavid commented 2 months ago

The upload mechanism for both pre-signed URLs and direct uploads to the client buffers data in memory, which is not ideal for large files and triggers OOM when uploading big files.

https://github.com/treeverse/lakeFS/blob/08fbdf21794ce61f4615a4e8f53248b1014d51fe/cmd/lakectl/cmd/fs_upload.go#L98

https://github.com/treeverse/lakeFS/blob/08fbdf21794ce61f4615a4e8f53248b1014d51fe/pkg/api/helpers/upload.go#L40

dvnicolasdh commented 1 month ago

We also got "lakectl" killed by the local host kernel, because it was trying to use more memory than was available (not using the "-p" option).

On a computer with 32GB ram (with 15GB already taken by other processes), we were finally able to commit 7.45GB binary files with "-p 1".

We think we will not be able to ingest larger binary files.

It seems that for binary files of 7GB, lakectl needs a little more than 2x the memory space of the large binary files per concurrent process requested (if do not specify "-p" the default seems to be 25).

ex.: if p=8, and the folder contains all 10GB binary files, we should expect "lakectl" requiring 8x10x2 = 160GB of RAM to avoid being killed when trying to upload (commit) the folder.

Is that right ? Are there options, or plans to allow ingestion of large binary files (larger files than the computer ram) ?

idanovo commented 5 days ago

Hi @andrijdavid,

A couple of questions: 1) What's the max size of each object in the directories you are trying to upload? 2) Do you get the same error when uploading a single file? 3) Do you get the same error when running with --pre-sign=false? 4) What OS do you use?

Sorry to bother you, but I want to understand the exact issue you faced, as there are many options.

idanovo commented 3 days ago

@andrijdavid @dvnicolasdh Thanks for reporting this issue, I think we found the issue; It's related to a bug in the [go-retryablehttp package we use. It reads files into the memory instead of streaming it.

As a temporary workaround, till we release a new version with a fix, you can set lakect not to use the retryable client by: 1) Adding this to your lakectl.yaml file

server: 
  retries: 
    enabled: false

Or 2) Running lakectl with this env var LAKECTL_SERVER_RETRIES_ENABLED=false

Can you please try this and let me know if it solved your issue?