minio / mc

Unix like utilities for object store
https://min.io/download
GNU Affero General Public License v3.0
2.84k stars 546 forks source link

`mc cp --recursive bucket/directory/prefix- .` not working as expected #4452

Closed WesselAtWork closed 11 months ago

WesselAtWork commented 1 year ago

Expected behaviour

I expected mc cp --recursive bucket/directory/prefix- . to recursively copy all the files /directory/prefix-* into the local directory.

Actual behaviour

Instead mc assumed that I meant to add a leading / to the copy query ( mc cp --recursive bucket/directory/prefix-/) and added it for me.

I could not find a flag or documentation on do a prefix copy and noticed that most everything that refers to a PATH is effected ( mv for instance )

I was surprised because listing behaved differently: mc ls bucket/directory/prefix- prints the /directory/prefix-* files correctly.

Steps to reproduce the behaviour

Attempt to refer to a prefix instead of a directory when using mc cp or mc mv

Did I misunderstand prefixes?

mc --version

mc version RELEASE.2023-01-11T03-14-16Z (commit-id=14c2e506fa78b53fb6db88bcf87d8f6d3fb6989e)
Runtime: go1.19.4 linux/amd64
Copyright (c) 2015-2023 MinIO, Inc.

System information

Using the container image

Workaround

Maybe actually the intended way but I am unsure of how heavy this request is. Image you have 10 000 objects in the bucket, does it list the entire bucket before it finds?

mc find bucket/directory --name "prefix-*" --exec "mc cp {} ."

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

WesselAtWork commented 1 year ago

Further investigation

It must be somewhere stat-ing happens. The exact error is mc: <ERROR> Unable to validate source alias/bucket/folder/object-pre: Object does not exist

I think I found the actual place it's breaking on : https://github.com/minio/mc/blob/ca1d00741b954210c264effb576ef1ed8567722c/cmd/cp-url-syntax.go#L60-L76

The Input Validation step. We also see the Unable to validate source in the error block.

Interestingly, I think cp should be fine with prefixes! This is the listing block:

https://github.com/minio/mc/blob/ca1d00741b954210c264effb576ef1ed8567722c/cmd/cp-url.go#L176-L186

It's the only place in the copy code that I could find that is List-ing. It's passing the parameters + the recursive flag into the function so I would expect it to behave exactly the same as mc ls -r.

When it breaks into the listing code this looks to also be able to handle it: https://github.com/minio/mc/blob/ca1d00741b954210c264effb576ef1ed8567722c/cmd/client-fs.go#L735-L737


I think I misunderstood what was "wrong" with the process. It doesn't look like mc is adding a leading delimiter, but that it is instead breaking on "invalid" input.

I'll try to find some time to test locally what happens if it doesn't break on the validation step.

WesselAtWork commented 1 year ago

It works! 🥇

..\mc.exe cp --recursive alias/bucket/junk/xza .
mc.exe: WARNING: Copy stat verification disabled
mc.exe: WARNING: CopySyntaxTypeC stat verification disabled
.../bucket/junk/xzazz: 5.28 MiB / 5.28 MiB [=========================================================================================================] 514.43 KiB/s 10s 

Checked file lists with diff. No problems :)


Accomplished via:

cp-url-syntax.go

    // Verify if source(s) exists.
    console.Infoln("WARNING: Copy stat verification disabled")
    //for _, srcURL := range srcURLs {
    //  var err *probe.Error
    //  if !isRecursive {
    //      _, _, err = url2Stat(ctx, srcURL, versionID, false, encKeyDB, timeRef, isZip)
    //  } else {
    //      _, _, err = firstURL2Stat(ctx, srcURL, timeRef, isZip)
    //  }
    //  if err != nil {
    //      msg := "Unable to validate source `" + srcURL + "`"
    //      if versionID != "" {
    //          msg += " (" + versionID + ")"
    //      }
    //      msg += ": " + err.ToGoError().Error()
    //      console.Fatalln(msg)
    //  }
    //}

cp-url-syntax

    console.Infoln("WARNING: CopySyntaxTypeC stat verification disabled")
    //for _, srcURL := range srcURLs {
    //  c, srcContent, err := url2Stat(ctx, srcURL, "", false, keys, timeRef, isZip)
    //  fatalIf(err.Trace(srcURL), "Unable to stat source `"+srcURL+"`.")
    //
    //  if srcContent.Type.IsDir() {
    //      // Require --recursive flag if we are copying a directory
    //      if !isRecursive {
    //          operation := "copy"
    //          if isMvCmd {
    //              operation = "move"
    //          }
    //          fatalIf(errInvalidArgument().Trace(srcURL), fmt.Sprintf("To %v a folder requires --recursive flag.", operation))
    //      }
    //
    //      // Check if we are going to copy a directory into itself
    //      if isURLContains(srcURL, tgtURL, string(c.GetURL().Separator)) {
    //          operation := "Copying"
    //          if isMvCmd {
    //              operation = "Moving"
    //          }
    //          fatalIf(errInvalidArgument().Trace(), fmt.Sprintf("%v a folder into itself is not allowed.", operation))
    //      }
    //  }
    //}

This was just a quick test to see if the copy code could cope if I disabled the stat checks.

Obviously some work is required but at least it is confirmed: copy can do prefixes

Side note:

There was a noticeable speed up in the execution of mc. Normally it hangs for a few seconds and then it starts running. I think the typeC cheeking block is slow. It needs to iterate through the entire source list (potentially 1000s of objects) before execution can continue, and this doesn't look like it's set up to be parallel.

The typeC block only checks one item with a HEAD and a LIST so it might just have been confirmation bias from my side

WesselAtWork commented 1 year ago

Core Issue

You don't need to disable the first block Copy Stat verification it's not the issue

The problem is in the second block.

Specifically in the url2Stat function which calls the s3 client Stat function which contains this line

https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-s3.go#L1620

It looks like mc is force checking delimited prefixes.

The correct thing to do would be to remove this line

The Problem

Can you help me define what these terms mean in minio?

They way I understand it in minio/s3 systems the object name is the entire path starting at the root of the bucket. There is nothing special about the /, it is just a character that we like to use because it reminds us of how files systems work.

The reason I bring this up is because there are 2 stat2url function that behave subtly differently and confusing comments that seems to refer to conflicting things.

url2stat / Stat

What is the Stat() function supposed to do?

The way I interpret the code:

The problem is that most of the comments use the terms directories and prefixes interchangeably

https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-s3.go#L1564-L1566 https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-s3.go#L1596-L1602 https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-s3.go#L1604-L1605 https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-s3.go#L1619-L1620 https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-s3.go#L1633-L1636

But the INTENT of the code is only to determine if it is a Object Key OR Object Directory AND NOT Object Prefix.

To make this code do what the comments imply is actually quite simple.

All you need to do is remove one line: https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-s3.go#L1620

Now this code will work with prefixes, directories, and objects.

Prefixes are now dirs

There is only one issue: You can't tell a folder/ apart from a pre

These will both return as isDir objects.

To me this is correct:

If I have this object structure

bucket/something.txt
bucket/something.png
bucket/something
bucket/something/esle.txt
bucket/something/else.txt

I can query it in multiple ways:

Copying now becomes:

The prefix bucket/something acts like a "Directory" holding the other files. It's just not a traditional fs directory.

I feel like this is a valid way of talking about prefixes.

If you want to talk about the /directory you need to be explicit.

But It won't work

The tests pass so why can't I remove that line?

It seems the code is aware of this subtle difference of how Stat and List behave.

Observe: https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/cp-url-syntax.go#L63-L67

As you can see, when isRecursive is flagged, we use a different function. Why? The description of the functions are more-or-less the same?

The description of url2stat seem to imply that it is a valid way to use it? https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-url.go#L188

It's not so different from firsturl2stat https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-url.go#L204

The answer is that it miss uses of the "prefix" term.

The way the cmd is set up in general (you can find this isRecursive check everywhere) it uses url2stat to refer to fs constructs (files<->objects and folders<->directories) and it uses firsturl2stat to infer to minio constructs (prefixes)

So how fix?

Well the original issue was with the Type C Validity check.

https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/cp-url-syntax.go#L170-L208

Following the crumbs back we get to this strategy method

https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/cp-url.go#L54-L90

Which uses the isRecursive / firstURL2Stat check.

The problem is then the Type C Validity check in cp-url-syntax doesn't use the same validity check that thecheckCopySyntax and the guessCopyURLType are using leading to it erroneously disregarding valid input.

My PR will target this function.

Workaround

This leads to a pretty funny workaround

Because the code can deal with prefixes and the only thing stopping mc is the checkCopySyntaxTypeC validity check

if your-prefix+delimiter is a valid prefix, mc will continue


.
├── my-prefix-a.txt
├── my-prefix-b.txt
├── my-prefix-c.txt
└── my-prefix-d.txt

If you try to mc cp --recursive bucket/my-prefix . it will error with <ERROR> Unable to stat source 'bucket/my-prefix' Object does not exist.

but if you make my-prefix a "valid" target you can trick mc into downloading the files

.
├── my-prefix-a.txt
├── my-prefix-b.txt
├── my-prefix-c.txt
├── my-prefix-d
└── my-prefix/
    └── anything.txt

Now mc cp --recursive bucket/my-prefix . will succeed and you will get your files :^)
As long as mc can validate the bucket/my-prefix/ path it will work

Workaround shell script (untested but gets the idea across)

mc_force_prefix() {
    echo "" | mc cp - "$1/tmp" &&
    mc cp --recursive "$1" "$2" &&
    mc rm "$1/tmp" &&
    rm "$2/tmp"
}
zveinn commented 11 months ago

@WesselAtWork any chance you can try the latest release and see if this is still a problem ? I recently refactored these methods in a major way and since it's still fresh in my head I would be more then willing to have another look.

WesselAtWork commented 11 months ago

Using master (b82f6c60b21306577864c08097bd3983bf28fda2)

mc cp --recursive alias/bucket/junk/xza .
.../bucket/junk/xzazz: 5.28 MiB / 5.28 MiB ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 445.26 KiB/s 12s

Phenomenal! It works fantastic. 🥇

Testing

Testing it with a "difficult edge case key list"

object name -- data content # comments

bucket/something/prefix         -- "lorem ipsom"
bucket/something/prefixjunk     -- "lorem ipsom"
bucket/something/prefix/        -- "lorem ipsom"   # '/' is arbitrary, this is a valid object name that contains data
bucket/something/prefix/junk1   -- "lorem ipsom"
bucket/something/prefix/junk2   -- "lorem ipsom"
bucket/something/prefix-junk1   -- "lorem ipsom"
bucket/something/prefix-junk2   -- "lorem ipsom"
bucket/something/prefix-1/junk  -- "lorem ipsom"   # note: a 0B prefix-1/ was not manually created
bucket/something/prefix-2/junk  -- "lorem ipsom"   # note: a 0B prefix-2/ was not manually created

Listing this results in:

mc ls alias/bucket/something/prefix
[TZ]    12B STANDARD prefix
[TZ]    12B STANDARD prefix-junk1
[TZ]    12B STANDARD prefix-junk2
[TZ]    12B STANDARD prefixjunk
[TZ]     0B prefix-1/
[TZ]     0B prefix-2/
[TZ]     0B prefix/                 # this is wrong but is reported correctly below
mc ls --recursive alias/bucket/something/prefix
[TZ]    12B STANDARD prefix
[TZ]    12B STANDARD prefix-junk1
[TZ]    12B STANDARD prefix-junk2
[TZ]    12B STANDARD prefix-1/junk
[TZ]    12B STANDARD prefix-2/junk
[TZ]    12B STANDARD prefix/        # lists size correctly now
[TZ]    12B STANDARD prefix/junk1
[TZ]    12B STANDARD prefix/junk2
[TZ]    12B STANDARD prefixjunk

Doing a copy correctly breaks with:

mc cp --recursive alias/bucket/something/prefix .
# ....
mc: <ERROR> Failed to copy `https://host/bucket/something/prefix`. rename /tmp/prefix.part.minio /tmp/prefix: file exists

This should fail because on a File System you can't have a folder and a file named the same thing... but all the other targets are valid and they are downloaded correctly!

ls -R /tmp
/tmp/:
prefix/
prefix-1/
prefix-2/
prefix-junk1
prefix-junk2
prefixjunk

/tmp/prefix:
junk1
junk2

/tmp/prefix-1:
junk

/tmp/prefix-2:
junk

👍 👍 👍


Note:

It might be a race condition where we win (the download happens before the move and one move failing doesn't result in the rest failing) but that might change so could use a test.

You might want to downgrade the <ERROR> to a <WARNING> if there were other successful file moves, so only <ERROR> if they all fail. but this is only semantics and as long as the valid files are downloaded it doesn't really matter.


BUT

mc cp alias/bucket/something/prefix/ ./file.txt
mc: <ERROR> Unable to prepare URL for copying. To copy or move 'https://host/bucket/something/prefix/' the --recursive flag is required.

This is wrong, the something/prefix/ key is valid and contains data I want: "lorem ipsom"

Notably: aws s3 cp deals with this edge case correctly

BUTBUT:

mc actually does do this perfectly, it just has a different interface.

mc cat alias/bucket/something/prefix/
lorem ipsom

This is fantastic. I think this is a WAY better way of dealing with the edge case of a object key that ends with the separator.
If you try to deal with it in the copy interface you need to deal with uncomfortable questions like:
"what does a --recursive on this object mean? Do we include it, do we exclude it? Now how do you create a file for it!? do we rename it? do we substitute the separator?"

This makes it clear that the user needs to deal with the data in whatever way they please: pipe it into a program (|), write into a file (>), load into a variable ( = $(...) ). It is up to the user.

This ALSO makes sense with how you got the data into the prefix/ key in the first place! you had to use mc pipe which is the reciprocal for mc cat!

In aws you need to know about the stream interface on cp: echo "lorem ipsom" | aws cp - s3://bucket/something/prefix/ but in mc it's all in their own correct interfaces!

WesselAtWork commented 11 months ago

Happy to close this, but I forget how the tracking works again: Do you need to close it to indicate your changes is the result, or does it not matter?

harshavardhana commented 11 months ago

No problem we will close it.

zveinn commented 11 months ago

@WesselAtWork happy to hear everything is working as expected.