minio / minio-py

MinIO Client SDK for Python
https://docs.min.io/docs/python-client-quickstart-guide.html
Apache License 2.0
851 stars 325 forks source link

Data corruption when copying big file through python API copy_object #1415

Closed mat-gas closed 6 months ago

mat-gas commented 6 months ago

We have ISO files stored on our minio server that were uploaded years ago

Those are heavily chunked (8 MiB parts)

When copying one of those ISO to another bucket, the new file is corrupted (size differs and 2nd chunk is a copy of the first one, albeit the size)

example with Win10_22H2_English_x64.iso (size 6115186688 , md5 68c70d7ade5e9ab8510876c1f4bee58a)

$ mc cat s3/vms/iso/Win10_22H2_English_x64.iso > /tmp/Win10_22H2_English_x64.iso

$ ls -al /tmp/Win10_22H2_English_x64.iso
-rw-rw-r-- 1 mat mat 6115186688 avril 30 13:04 /tmp/Win10_22H2_English_x64.iso

$ md5sum /tmp/Win10_22H2_English_x64.iso
68c70d7ade5e9ab8510876c1f4bee58a  /tmp/Win10_22H2_English_x64.iso

copy with python minio API

client = Minio("xxxxxx,
                "xxxxxxxxx",
                "xxxxxxxx",
                secure=False,
                )
client.copy_object("temp", "copy-win10", CopySource("vms", "iso/Win10_22H2_English_x64.iso"))

download again, file is corrupted:

mc cat s3/temp/copy-win10 > /tmp/copy-win10

ll /tmp/copy-win10
-rw-rw-r-- 1 mat mat 6115186690 avril 30 13:09 /tmp/copy-win10

md5sum /tmp/copy-win10
5cc45c2871d7946360c630bc22702f9a  /tmp/copy-win10

image

original file at offset 0x8000, we can see xCD001...

hd /tmp/Win10_22H2_English_x64.iso |head
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00008000  01 43 44 30 30 31 01 00  20 20 20 20 20 20 20 20  |.CD001..        |
00008010  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
00008020  20 20 20 20 20 20 20 20  43 43 43 4f 4d 41 5f 58  |        CCCOMA_X|
00008030  36 34 46 52 45 5f 45 4e  2d 55 53 5f 44 56 39 20  |64FRE_EN-US_DV9 |
00008040  20 20 20 20 20 20 20 20  00 00 00 00 00 00 00 00  |        ........|
00008050  cb 8f 2d 00 00 2d 8f cb  00 00 00 00 00 00 00 00  |..-..-..........|
00008060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00008070  00 00 00 00 00 00 00 00  01 00 00 01 01 00 00 01  |................|

and in corrupted file, second chunk (that starts at offset 0x14000001 , we can see at offset 0x8000 after that we have the same data)

image

When using mc cp, the problem does not appear (file is still OK).

mc cp s3/vms/iso/Win10_22H2_English_x64.iso s3/temp/copy-win10-2

When looking at the xl.meta file, we can see that it's chunked differently (lots of 500 MiB chunks instead of 2 chunks (1 very big and 1 small))

xl.meta of original file (Win10_22H2_English_x64.iso)

{
  "Versions": [
    {
      "Header": {
        "Flags": 2,
        "ModTime": "2023-03-31T15:52:48.262030574Z",
        "Signature": "0af826f5",
        "Type": 1,
        "VersionID": "00000000000000000000000000000000"
      },
      "Idx": 0,
      "Metadata": {
        "Type": 1,
        "V2Obj": {
          "CSumAlgo": 1,
          "DDir": "m58zKSb4TlK2t+GBv2ZjEQ==",
          "EcAlgo": 1,
          "EcBSize": 1048576,
          "EcDist": [
            8,
            9,
            10,
            11,
            12,
            13,
            14,
            15,
            16,
            1,
            2,
            3,
            4,
            5,
            6,
            7
          ],
          "EcIndex": 4,
          "EcM": 12,
          "EcN": 4,
          "ID": "AAAAAAAAAAAAAAAAAAAAAA==",
          "MTime": 1680277968262030574,
          "MetaSys": {
            "X-Minio-Internal-actual-size": "NjExNTE4NjY4OA=="
          },
          "MetaUsr": {
            "content-type": "application/octet-stream",
            "etag": "f0bfc2ed484b4753671d7f198fc4ab8a-729"
          },
          "PartASizes": [
            8388608,
            8388608,
            8388608,
            8388608,
    ....
            8388608,
            8388608,
            8388608,
            8388608,
            8388608,
            8388608,
            8280064
          ],
          "PartETags": null,
          "PartNums": [
            1,
            2,
            3,
            4,
            5,
     ....
            724,
            725,
            726,
            727,
            728,
            729
          ],
          "PartSizes": [
            8388608,
            8388608,
            8388608,
            8388608,
       ....
            8388608,
            8388608,
            8388608,
            8388608,
            8388608,
            8280064
          ],
          "Size": 6115186688
        },
        "v": 1670873247
      }
    }
  ]
}

xl.meta of copy file through python package

{
  "Versions": [
    {
      "Header": {
        "Flags": 2,
        "ModTime": "2024-04-30T11:07:19.285158502Z",
        "Signature": "116281b7",
        "Type": 1,
        "VersionID": "00000000000000000000000000000000"
      },
      "Idx": 0,
      "Metadata": {
        "Type": 1,
        "V2Obj": {
          "CSumAlgo": 1,
          "DDir": "n9/l3oUlSaelyWQuHBjcvA==",
          "EcAlgo": 1,
          "EcBSize": 1048576,
          "EcDist": [
            3,
            4,
            5,
            6,
            7,
            8,
            9,
            10,
            11,
            12,
            13,
            14,
            15,
            16,
            1,
            2
          ],
          "EcIndex": 4,
          "EcM": 11,
          "EcN": 5,
          "ID": "AAAAAAAAAAAAAAAAAAAAAA==",
          "MTime": 1714475239285158502,
          "MetaSys": {
            "X-Minio-Internal-actual-size": "NjExNTE4NjY5MA==",
            "x-minio-internal-erasure-upgraded": "NC0+NQ=="
          },
          "MetaUsr": {
            "content-type": "application/octet-stream",
            "etag": "fa1a218abd93d72edc528a24b29631bd-2"
          },
          "PartASizes": [
            5368709121,
            746477569
          ],
          "PartETags": null,
          "PartNums": [
            1,
            2
          ],
          "PartSizes": [
            5368709121,
            746477569
          ],
          "Size": 6115186690
        },
        "v": 1711791716
      }
    }
  ]
}

xl.meta from "mc cp" (size os good, file is OK)

{
  "Versions": [
    {
      "Header": {
        "Flags": 2,
        "ModTime": "2024-04-30T11:44:50.810659541Z",
        "Signature": "212c471f",
        "Type": 1,
        "VersionID": "00000000000000000000000000000000"
      },
      "Idx": 0,
      "Metadata": {
        "Type": 1,
        "V2Obj": {
          "CSumAlgo": 1,
          "DDir": "iq/P9//wTZmse8HaQxRnlw==",
          "EcAlgo": 1,
          "EcBSize": 1048576,
          "EcDist": [
            15,
            16,
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9,
            10,
            11,
            12,
            13,
            14
          ],
          "EcIndex": 12,
          "EcM": 11,
          "EcN": 5,
          "ID": "AAAAAAAAAAAAAAAAAAAAAA==",
          "MTime": 1714477490810659541,
          "MetaSys": {
            "X-Minio-Internal-actual-size": "NjExNTE4NjY4OA==",
            "x-minio-internal-erasure-upgraded": "NC0+NQ=="
          },
          "MetaUsr": {
            "content-type": "application/octet-stream",
            "etag": "5ecf8f43cc2b14d6c0aced8814362747-12"
          },
          "PartASizes": [
            509598891,
            509598891,
            509598891,
            509598891,
            509598891,
            509598891,
            509598891,
            509598891,
            509598890,
            509598890,
            509598890,
            509598890
          ],
          "PartETags": null,
          "PartNums": [
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9,
            10,
            11,
            12
          ],
          "PartSizes": [
            509598891,
            509598891,
            509598891,
            509598891,
            509598891,
            509598891,
            509598891,
            509598891,
            509598890,
            509598890,
            509598890,
            509598890
          ],
          "Size": 6115186688
        },
        "v": 1711791716
      }
    }
  ]
}

Your Environment

minio RELEASE.2024-03-30T09-41-56Z

python minio package 7.2.5

mat-gas commented 6 months ago

OK, it might be related to the python API not raising any error/exception actually

compose_object should handle it transparently? (in here https://github.com/minio/minio-py/blob/e10196f5b6dd5910722b52d184986cb2de6a89cd/minio/api.py#L1362 )

client.copy_object("temp", "test-upload2", CopySource("temp", "test-upload"))

or should a hard exception be raised instead of copying data and corrupting the copy?

image

https://github.com/minio/minio-py/blob/e10196f5b6dd5910722b52d184986cb2de6a89cd/minio/api.py#L1268

or it it a bug in compose_object where start_bytes is always == offset (here 0) at line 1620 whereas remaining size is still updated at line 1621 ?

image

harshavardhana commented 6 months ago

In a multipart what are the final parts matter since server doesn't know the whole of the object.

Can you share a proper reproducer?

balamurugana commented 6 months ago

@mat-gas Enable Minio.trace_on() and share the output.

mat-gas commented 6 months ago
dd if=/dev/zero of=/tmp/zero bs=1G count=6
echo "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" > /tmp/test-upload3
cat /tmp/zero >> /tmp/test-upload3

md5sum /tmp/test-upload3
e3d297afb6078748ce597e88bcab5316  /tmp/test-upload3

ls -al /tmp/test-upload
-rw-rw-r-- 1 mat mat 6442450982 avril 30 11:52 /tmp/test-upload3
{
  "Versions": [
    {
      "Header": {
        "Flags": 2,
        "ModTime": "2024-04-30T15:27:46.484578361Z",
        "Signature": "1b3500c5",
        "Type": 1,
        "VersionID": "00000000000000000000000000000000"
      },
      "Idx": 0,
      "Metadata": {
        "Type": 1,
        "V2Obj": {
          "CSumAlgo": 1,
          "DDir": "/8hFEFpcQleRWuS3ACrRvg==",
          "EcAlgo": 1,
          "EcBSize": 1048576,
          "EcDist": [
            10,
            11,
            12,
            13,
            14,
            15,
            16,
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9
          ],
          "EcIndex": 7,
          "EcM": 11,
          "EcN": 5,
          "ID": "AAAAAAAAAAAAAAAAAAAAAA==",
          "MTime": 1714490866484578361,
          "MetaSys": {
            "x-minio-internal-erasure-upgraded": "NC0+NQ=="
          },
          "MetaUsr": {
            "content-type": "application/octet-stream",
            "etag": "e3d297afb6078748ce597e88bcab5316"
          },
          "PartASizes": [
            6442450982
          ],
          "PartETags": null,
          "PartNums": [
            1
          ],
          "PartSizes": [
            6442450982
          ],
          "Size": 6442450982
        },
        "v": 1711791716
      }
    }
  ]
}

- copy file : `client.copy_object("temp", "test-upload4", CopySource("temp", "test-upload3"))`

- stat on new file (note the 2 additional bytes): 

client.stat_object("temp", "test-upload4").size 6442450984


- xl.meta on test-upload4 (copied file)

{ "Versions": [ { "Header": { "Flags": 2, "ModTime": "2024-04-30T15:39:03.075700792Z", "Signature": "5e1c9f8c", "Type": 1, "VersionID": "00000000000000000000000000000000" }, "Idx": 0, "Metadata": { "Type": 1, "V2Obj": { "CSumAlgo": 1, "DDir": "Z1VPwIFAT+6+u/SR+S5vkg==", "EcAlgo": 1, "EcBSize": 1048576, "EcDist": [ 13, 14, 15, 16, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ], "EcIndex": 10, "EcM": 11, "EcN": 5, "ID": "AAAAAAAAAAAAAAAAAAAAAA==", "MTime": 1714491543075700792, "MetaSys": { "X-Minio-Internal-actual-size": "NjQ0MjQ1MDk4NA==", "x-minio-internal-erasure-upgraded": "NC0+NQ==" }, "MetaUsr": { "content-type": "application/octet-stream", "etag": "b70e57ea6e8dd07a73ebf717c8b09e3b-2" }, "PartASizes": [ 5368709121, 1073741863 ], "PartETags": null, "PartNums": [ 1, 2 ], "PartSizes": [ 5368709121, 1073741863 ], "Size": 6442450984 }, "v": 1711791716 } } ] }


logs from `trace_on`:

problem probably comes from byte-range in part2+ which starts again at 0 instead of being at 5 Gib+

**PUT /temp/test-upload4?partNumber=2..**

**X-Amz-Copy-Source-Range: bytes=0-1073741862**

PUT /temp/test-upload4?partNumber=2&uploadId=MjViYjlkZjktMDFjMi00MjBhLWIxZDUtOTdkZDFlZTNmOWExLmVlZmY4ZmQ2LThiZWMtNGI1ZS05MjVlLTk2NWE0Yzk5ZmEzMg HTTP/1.1 X-Amz-Copy-Source: /temp/test-upload3 X-Amz-Copy-Source-If-Match: e3d297afb6078748ce597e88bcab5316 X-Amz-Copy-Source-Range: bytes=0-1073741862

---------START-HTTP--------- HEAD /temp/test-upload3 HTTP/1.1 Host: xxxxxxx:9000 User-Agent: MinIO (Linux; x86_64) minio-py/7.2.5 X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 X-Amz-Date: 20240430T153839Z Authorization: AWS4-HMAC-SHA256 Credential=REDACTED/20240430/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=REDACTED

HTTP/1.1 200 Accept-Ranges: bytes Content-Length: 6442450982 Content-Type: application/octet-stream ETag: "e3d297afb6078748ce597e88bcab5316" Last-Modified: Tue, 30 Apr 2024 15:27:46 GMT Server: MinIO Strict-Transport-Security: max-age=31536000; includeSubDomains Vary: Origin Vary: Accept-Encoding X-Amz-Id-2: 6030d01c843caef4c6622bc183cd7868c4457173df9b0e60c3980e7c86b7b0b4 X-Amz-Request-Id: 17CB18F3FFD63302 X-Content-Type-Options: nosniff X-Xss-Protection: 1; mode=block Date: Tue, 30 Apr 2024 15:38:39 GMT

----------END-HTTP---------- ---------START-HTTP--------- HEAD /temp/test-upload3 HTTP/1.1 Host: xxxxxxx:9000 User-Agent: MinIO (Linux; x86_64) minio-py/7.2.5 X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 X-Amz-Date: 20240430T153839Z Authorization: AWS4-HMAC-SHA256 Credential=REDACTED/20240430/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=REDACTED

HTTP/1.1 200 Accept-Ranges: bytes Content-Length: 6442450982 Content-Type: application/octet-stream ETag: "e3d297afb6078748ce597e88bcab5316" Last-Modified: Tue, 30 Apr 2024 15:27:46 GMT Server: MinIO Strict-Transport-Security: max-age=31536000; includeSubDomains Vary: Origin Vary: Accept-Encoding X-Amz-Id-2: 6030d01c843caef4c6622bc183cd7868c4457173df9b0e60c3980e7c86b7b0b4 X-Amz-Request-Id: 17CB18F4003531B0 X-Content-Type-Options: nosniff X-Xss-Protection: 1; mode=block Date: Tue, 30 Apr 2024 15:38:39 GMT

----------END-HTTP---------- ---------START-HTTP--------- POST /temp/test-upload4?uploads= HTTP/1.1 Content-Type: application/octet-stream Host: xxxxxxx:9000 User-Agent: MinIO (Linux; x86_64) minio-py/7.2.5 X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 X-Amz-Date: 20240430T153839Z Authorization: AWS4-HMAC-SHA256 Credential=REDACTED/20240430/us-east-1/s3/aws4_request, SignedHeaders=content-type;host;x-amz-content-sha256;x-amz-date, Signature=REDACTED

HTTP/1.1 200 Accept-Ranges: bytes Content-Length: 332 Content-Type: application/xml Server: MinIO Strict-Transport-Security: max-age=31536000; includeSubDomains Vary: Origin Vary: Accept-Encoding X-Amz-Id-2: 6030d01c843caef4c6622bc183cd7868c4457173df9b0e60c3980e7c86b7b0b4 X-Amz-Request-Id: 17CB18F400D1180F X-Content-Type-Options: nosniff X-Xss-Protection: 1; mode=block Date: Tue, 30 Apr 2024 15:38:39 GMT

<?xml version="1.0" encoding="UTF-8"?>

temptest-upload4MjViYjlkZjktMDFjMi00MjBhLWIxZDUtOTdkZDFlZTNmOWExLmVlZmY4ZmQ2LThiZWMtNGI1ZS05MjVlLTk2NWE0Yzk5ZmEzMg

----------END-HTTP---------- ---------START-HTTP--------- PUT /temp/test-upload4?partNumber=1&uploadId=MjViYjlkZjktMDFjMi00MjBhLWIxZDUtOTdkZDFlZTNmOWExLmVlZmY4ZmQ2LThiZWMtNGI1ZS05MjVlLTk2NWE0Yzk5ZmEzMg HTTP/1.1 X-Amz-Copy-Source: /temp/test-upload3 X-Amz-Copy-Source-If-Match: e3d297afb6078748ce597e88bcab5316 X-Amz-Copy-Source-Range: bytes=0-5368709120 Host: xxxxxxx:9000 User-Agent: MinIO (Linux; x86_64) minio-py/7.2.5 X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 X-Amz-Date: 20240430T153839Z Authorization: AWS4-HMAC-SHA256 Credential=REDACTED/20240430/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-copy-source;x-amz-copy-source-if-match;x-amz-copy-source-range;x-amz-date, Signature=REDACTED

HTTP/1.1 200 Accept-Ranges: bytes Content-Length: 228 Content-Type: application/xml Server: MinIO Strict-Transport-Security: max-age=31536000; includeSubDomains Vary: Origin Vary: Accept-Encoding X-Amz-Id-2: 6030d01c843caef4c6622bc183cd7868c4457173df9b0e60c3980e7c86b7b0b4 X-Amz-Request-Id: 17CB18F401288A60 X-Content-Type-Options: nosniff X-Xss-Protection: 1; mode=block Date: Tue, 30 Apr 2024 15:38:59 GMT

<?xml version="1.0" encoding="UTF-8"?>

2024-04-30T15:38:59.237Z"34dc525f81892828fdc1b4e815415a34"

----------END-HTTP---------- ---------START-HTTP--------- PUT /temp/test-upload4?partNumber=2&uploadId=MjViYjlkZjktMDFjMi00MjBhLWIxZDUtOTdkZDFlZTNmOWExLmVlZmY4ZmQ2LThiZWMtNGI1ZS05MjVlLTk2NWE0Yzk5ZmEzMg HTTP/1.1 X-Amz-Copy-Source: /temp/test-upload3 X-Amz-Copy-Source-If-Match: e3d297afb6078748ce597e88bcab5316 X-Amz-Copy-Source-Range: bytes=0-1073741862 Host: xxxxxxx:9000 User-Agent: MinIO (Linux; x86_64) minio-py/7.2.5 X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 X-Amz-Date: 20240430T153859Z Authorization: AWS4-HMAC-SHA256 Credential=REDACTED/20240430/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-copy-source;x-amz-copy-source-if-match;x-amz-copy-source-range;x-amz-date, Signature=REDACTED

HTTP/1.1 200 Accept-Ranges: bytes Content-Length: 228 Content-Type: application/xml Server: MinIO Strict-Transport-Security: max-age=31536000; includeSubDomains Vary: Origin Vary: Accept-Encoding X-Amz-Id-2: 6030d01c843caef4c6622bc183cd7868c4457173df9b0e60c3980e7c86b7b0b4 X-Amz-Request-Id: 17CB18F89F391B01 X-Content-Type-Options: nosniff X-Xss-Protection: 1; mode=block Date: Tue, 30 Apr 2024 15:39:03 GMT

<?xml version="1.0" encoding="UTF-8"?>

2024-04-30T15:39:03.063Z"29bee6adaa6905269145eb6235e70ef8"

----------END-HTTP---------- ---------START-HTTP--------- POST /temp/test-upload4?uploadId=MjViYjlkZjktMDFjMi00MjBhLWIxZDUtOTdkZDFlZTNmOWExLmVlZmY4ZmQ2LThiZWMtNGI1ZS05MjVlLTk2NWE0Yzk5ZmEzMg HTTP/1.1 Content-Type: application/xml Content-Md5: 88HRuTmwrl0BTBvrjeUSag== Host: xxxxxxx:9000 User-Agent: MinIO (Linux; x86_64) minio-py/7.2.5 Content-Length: 271 X-Amz-Content-Sha256: ba0b1583e910db231e8a87b1d2e658900ccce1ecca1faeccfbabaf83e82d9667 X-Amz-Date: 20240430T153903Z Authorization: AWS4-HMAC-SHA256 Credential=REDACTED/20240430/us-east-1/s3/aws4_request, SignedHeaders=content-length;content-md5;content-type;host;x-amz-content-sha256;x-amz-date, Signature=REDACTED

1"34dc525f81892828fdc1b4e815415a34"2"29bee6adaa6905269145eb6235e70ef8"

HTTP/1.1 200 Accept-Ranges: bytes Content-Length: 351 Content-Type: application/xml ETag: "b70e57ea6e8dd07a73ebf717c8b09e3b-2" Server: MinIO Strict-Transport-Security: max-age=31536000; includeSubDomains Vary: Origin Vary: Accept-Encoding X-Amz-Id-2: 6030d01c843caef4c6622bc183cd7868c4457173df9b0e60c3980e7c86b7b0b4 X-Amz-Request-Id: 17CB18F9834B0910 X-Content-Type-Options: nosniff X-Xss-Protection: 1; mode=block Date: Tue, 30 Apr 2024 15:39:03 GMT

<?xml version="1.0" encoding="UTF-8"?>

http://xxxxxxx:9000/temp/test-upload4temptest-upload4"b70e57ea6e8dd07a73ebf717c8b09e3b-2"

----------END-HTTP----------

balamurugana commented 6 months ago

@mat-gas Please check PR https://github.com/minio/minio-py/pull/1416 fixes the issue.

mat-gas commented 6 months ago

@balamurugana confirmed the fix solves the issue

could you release a new version of the package on pypi with this fix please? we've had serious data corruption due to this and would need the fix to continue working with our python scripts

anyway, thanks for the quick fix!

minio-trusted commented 6 months ago

v7.2.7 is released.