iterative / dvc-s3

AWS S3 plugin for dvc
Apache License 2.0
10 stars 8 forks source link

Pushing large files error #99

Open ermolaev94 opened 2 months ago

ermolaev94 commented 2 months ago

Overview

Pushing large files under S3 bucker leads to the following error:

Argument partNumber must be an integer between 1 and 10000.: An error occurred (InvalidArgument) when calling the UploadPart operation: Argument partNumber must be an integer between 1 and 10000.

I've tried to fix situation by settin chunk size according to the AWS documentation:

# ~/.aws/config
[default]
s3 =
    multipart_chunksize = 512

It does not help. I've tried to debug dvc-s3 and cheked that argument is read, but it's not clear how it is used. I've noticed that "s3" config stayed empty, while "self._transfer_config" has updated.

Problem starting from 800Gb file size.

shcheklein commented 2 months ago

@ermolaev94 is it S3-compatible storage? (yandex cloud or something)? Just curious if it is something specific about them ....

dberenbaum commented 2 months ago

According to the aws docs, it looks like multipart_chunksize takes either the size in bytes or else requires a size suffix, so could it be as simple as needing to set multipart_chunksize = 512MB?

ermolaev94 commented 2 months ago

@ermolaev94 is it S3-compatible storage? (yandex cloud or something)? Just curious if it is something specific about them ....

It's yandex-s3, single file limit is 5Tb.

According to the aws docs, it looks like multipart_chunksize takes either the size in bytes or else requires a size suffix, so could it be as simple as needing to set multipart_chunksize = 512MB?

Hm, thx, run this command. I will return with the update ASAP.

ermolaev94 commented 2 months ago

According to the aws docs, it looks like multipart_chunksize takes either the size in bytes or else requires a size suffix, so could it be as simple as needing to set multipart_chunksize = 512MB?

I've tried your suggestion and error is still the same.

My config file for AWS is the following:

[default]
region = ru-central1
s3 =
    multipart_chunksize = 512MB

I've generated huge file with the following command:

$ dd if=/dev/urandom of=large_file.bin bs=1M count=1228800

File is ~1.1Tb, count of chunks with the single chunk size = 512Mb should be approximately <2300.

Then I've run dvc add & push:

$ dvc add large_file.bin
$ dvc push large_file.bin.dvc
...
Argument partNumber must be an integer between 1 and 10000.: An error occurred (InvalidArgument) when calling the UploadPart operation: Argument partNumber must be an integer between 1 and 10000.

and have got the same issue.

Then I've tried to push via AWS-CLI:

$ aws --endpoint-url=https://storage.yandexcloud.net/ s3 cp large_file.bin s3://<bucket-name>/large_file.bin

and it works fine

image

I suppose that aws cp works not in the same way as dvc push does, but I didn't find exact command in dvc-s3 package to repeat. Anyway, it looks like there is a bug.