Closed WesselAtWork closed 11 months ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
It must be somewhere stat
-ing happens. The exact error is mc: <ERROR> Unable to validate source alias/bucket/folder/object-pre: Object does not exist
I think I found the actual place it's breaking on : https://github.com/minio/mc/blob/ca1d00741b954210c264effb576ef1ed8567722c/cmd/cp-url-syntax.go#L60-L76
The Input Validation step. We also see the Unable to validate source
in the error block.
Interestingly, I think cp
should be fine with prefixes! This is the listing block:
https://github.com/minio/mc/blob/ca1d00741b954210c264effb576ef1ed8567722c/cmd/cp-url.go#L176-L186
It's the only place in the copy code that I could find that is List
-ing. It's passing the parameters + the recursive flag into the function so I would expect it to behave exactly the same as mc ls -r
.
When it breaks into the listing code this looks to also be able to handle it: https://github.com/minio/mc/blob/ca1d00741b954210c264effb576ef1ed8567722c/cmd/client-fs.go#L735-L737
I think I misunderstood what was "wrong" with the process. It doesn't look like mc
is adding a leading delimiter, but that it is instead breaking on "invalid" input.
I'll try to find some time to test locally what happens if it doesn't break on the validation step.
It works! 🥇
..\mc.exe cp --recursive alias/bucket/junk/xza .
mc.exe: WARNING: Copy stat verification disabled
mc.exe: WARNING: CopySyntaxTypeC stat verification disabled
.../bucket/junk/xzazz: 5.28 MiB / 5.28 MiB [=========================================================================================================] 514.43 KiB/s 10s
Checked file lists with diff. No problems :)
Accomplished via:
// Verify if source(s) exists.
console.Infoln("WARNING: Copy stat verification disabled")
//for _, srcURL := range srcURLs {
// var err *probe.Error
// if !isRecursive {
// _, _, err = url2Stat(ctx, srcURL, versionID, false, encKeyDB, timeRef, isZip)
// } else {
// _, _, err = firstURL2Stat(ctx, srcURL, timeRef, isZip)
// }
// if err != nil {
// msg := "Unable to validate source `" + srcURL + "`"
// if versionID != "" {
// msg += " (" + versionID + ")"
// }
// msg += ": " + err.ToGoError().Error()
// console.Fatalln(msg)
// }
//}
console.Infoln("WARNING: CopySyntaxTypeC stat verification disabled")
//for _, srcURL := range srcURLs {
// c, srcContent, err := url2Stat(ctx, srcURL, "", false, keys, timeRef, isZip)
// fatalIf(err.Trace(srcURL), "Unable to stat source `"+srcURL+"`.")
//
// if srcContent.Type.IsDir() {
// // Require --recursive flag if we are copying a directory
// if !isRecursive {
// operation := "copy"
// if isMvCmd {
// operation = "move"
// }
// fatalIf(errInvalidArgument().Trace(srcURL), fmt.Sprintf("To %v a folder requires --recursive flag.", operation))
// }
//
// // Check if we are going to copy a directory into itself
// if isURLContains(srcURL, tgtURL, string(c.GetURL().Separator)) {
// operation := "Copying"
// if isMvCmd {
// operation = "Moving"
// }
// fatalIf(errInvalidArgument().Trace(), fmt.Sprintf("%v a folder into itself is not allowed.", operation))
// }
// }
//}
This was just a quick test to see if the copy code could cope if I disabled the stat checks.
Obviously some work is required but at least it is confirmed: copy can do prefixes
There was a noticeable speed up in the execution of
mc
.Normally it hangs for a few seconds and then it starts running.
I think the typeC cheeking block is slow. It needs to iterate through the entire source list (potentially 1000s of objects) before execution can continue, and this doesn't look like it's set up to be parallel.
The typeC block only checks one item with a HEAD and a LIST so it might just have been confirmation bias from my side
You don't need to disable the first block Copy Stat verification it's not the issue
The problem is in the second block.
Specifically in the url2Stat
function which calls the s3 client Stat
function which contains this line
https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-s3.go#L1620
It looks like mc
is force checking delimited prefixes.
The correct thing to do would be to remove this line
Can you help me define what these terms mean in minio?
They way I understand it in minio/s3 systems the object name is the entire path starting at the root of the bucket. There is nothing special about the /
, it is just a character that we like to use because it reminds us of how files systems work.
The reason I bring this up is because there are 2 stat2url function that behave subtly differently and confusing comments that seems to refer to conflicting things.
What is the Stat() function supposed to do?
The way I interpret the code:
The problem is that most of the comments use the terms directories and prefixes interchangeably
https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-s3.go#L1564-L1566 https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-s3.go#L1596-L1602 https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-s3.go#L1604-L1605 https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-s3.go#L1619-L1620 https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-s3.go#L1633-L1636
But the INTENT of the code is only to determine if it is a Object Key OR Object Directory AND NOT Object Prefix.
To make this code do what the comments imply is actually quite simple.
All you need to do is remove one line: https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-s3.go#L1620
Now this code will work with prefixes, directories, and objects.
There is only one issue: You can't tell a folder/
apart from a pre
These will both return as isDir
objects.
To me this is correct:
If I have this object structure
bucket/something.txt
bucket/something.png
bucket/something
bucket/something/esle.txt
bucket/something/else.txt
I can query it in multiple ways:
mc ls bucket/something
-> will return everythingmc ls bucket/something/
-> will return else.txt
and esle.txt
mc ls bucket/something.
-> will return something.txt
and something.png
Copying now becomes:
mc cp --recursive bucket/something .
-> will get everythingmc cp --recursive bucket/something/ .
-> will get else.txt
and esle.txt
mc cp --recursive bucket/something. .
-> will get something.txt
and something.png
mc cp bucket/something .
-> will only get the something
objectThe prefix bucket/something
acts like a "Directory" holding the other files. It's just not a traditional fs directory.
I feel like this is a valid way of talking about prefixes.
If you want to talk about the /
directory you need to be explicit.
The tests pass so why can't I remove that line?
It seems the code is aware of this subtle difference of how Stat and List behave.
As you can see, when isRecursive
is flagged, we use a different function. Why? The description of the functions are more-or-less the same?
The description of url2stat
seem to imply that it is a valid way to use it?
https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-url.go#L188
It's not so different from firsturl2stat
https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/client-url.go#L204
The answer is that it miss uses of the "prefix" term.
The way the cmd is set up in general (you can find this isRecursive check everywhere) it uses url2stat
to refer to fs constructs (files<->objects and folders<->directories) and it uses firsturl2stat
to infer to minio constructs (prefixes)
Well the original issue was with the Type C Validity check.
Following the crumbs back we get to this strategy method
https://github.com/minio/mc/blob/bf3924b58341eb7a71785653a29bf26ca9fac95e/cmd/cp-url.go#L54-L90
Which uses the isRecursive / firstURL2Stat check.
The problem is then the Type C Validity check in cp-url-syntax
doesn't use the same validity check that thecheckCopySyntax
and the guessCopyURLType
are using leading to it erroneously disregarding valid input.
My PR will target this function.
This leads to a pretty funny workaround
Because the code can deal with prefixes and the only thing stopping mc
is the checkCopySyntaxTypeC
validity check
if your-prefix+delimiter
is a valid prefix, mc will continue
.
├── my-prefix-a.txt
├── my-prefix-b.txt
├── my-prefix-c.txt
└── my-prefix-d.txt
If you try to mc cp --recursive bucket/my-prefix .
it will error with <ERROR> Unable to stat source 'bucket/my-prefix' Object does not exist.
but if you make my-prefix
a "valid" target you can trick mc
into downloading the files
.
├── my-prefix-a.txt
├── my-prefix-b.txt
├── my-prefix-c.txt
├── my-prefix-d
└── my-prefix/
└── anything.txt
Now mc cp --recursive bucket/my-prefix .
will succeed and you will get your files :^)
As long as mc can validate the bucket/my-prefix/
path it will work
Workaround shell script (untested but gets the idea across)
mc_force_prefix() {
echo "" | mc cp - "$1/tmp" &&
mc cp --recursive "$1" "$2" &&
mc rm "$1/tmp" &&
rm "$2/tmp"
}
@WesselAtWork any chance you can try the latest release and see if this is still a problem ? I recently refactored these methods in a major way and since it's still fresh in my head I would be more then willing to have another look.
Using master (b82f6c60b21306577864c08097bd3983bf28fda2)
mc cp --recursive alias/bucket/junk/xza .
.../bucket/junk/xzazz: 5.28 MiB / 5.28 MiB ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 445.26 KiB/s 12s
Phenomenal! It works fantastic. 🥇
Testing it with a "difficult edge case key list"
object name -- data content # comments
bucket/something/prefix -- "lorem ipsom"
bucket/something/prefixjunk -- "lorem ipsom"
bucket/something/prefix/ -- "lorem ipsom" # '/' is arbitrary, this is a valid object name that contains data
bucket/something/prefix/junk1 -- "lorem ipsom"
bucket/something/prefix/junk2 -- "lorem ipsom"
bucket/something/prefix-junk1 -- "lorem ipsom"
bucket/something/prefix-junk2 -- "lorem ipsom"
bucket/something/prefix-1/junk -- "lorem ipsom" # note: a 0B prefix-1/ was not manually created
bucket/something/prefix-2/junk -- "lorem ipsom" # note: a 0B prefix-2/ was not manually created
Listing this results in:
mc ls alias/bucket/something/prefix
[TZ] 12B STANDARD prefix
[TZ] 12B STANDARD prefix-junk1
[TZ] 12B STANDARD prefix-junk2
[TZ] 12B STANDARD prefixjunk
[TZ] 0B prefix-1/
[TZ] 0B prefix-2/
[TZ] 0B prefix/ # this is wrong but is reported correctly below
mc ls --recursive alias/bucket/something/prefix
[TZ] 12B STANDARD prefix
[TZ] 12B STANDARD prefix-junk1
[TZ] 12B STANDARD prefix-junk2
[TZ] 12B STANDARD prefix-1/junk
[TZ] 12B STANDARD prefix-2/junk
[TZ] 12B STANDARD prefix/ # lists size correctly now
[TZ] 12B STANDARD prefix/junk1
[TZ] 12B STANDARD prefix/junk2
[TZ] 12B STANDARD prefixjunk
Doing a copy correctly breaks with:
mc cp --recursive alias/bucket/something/prefix .
# ....
mc: <ERROR> Failed to copy `https://host/bucket/something/prefix`. rename /tmp/prefix.part.minio /tmp/prefix: file exists
This should fail because on a File System you can't have a folder and a file named the same thing... but all the other targets are valid and they are downloaded correctly!
ls -R /tmp
/tmp/:
prefix/
prefix-1/
prefix-2/
prefix-junk1
prefix-junk2
prefixjunk
/tmp/prefix:
junk1
junk2
/tmp/prefix-1:
junk
/tmp/prefix-2:
junk
👍 👍 👍
Note:
It might be a race condition where we win (the download happens before the move and one move failing doesn't result in the rest failing) but that might change so could use a test.
You might want to downgrade the
<ERROR>
to a<WARNING>
if there were other successful file moves, so only<ERROR>
if they all fail. but this is only semantics and as long as the valid files are downloaded it doesn't really matter.
mc cp alias/bucket/something/prefix/ ./file.txt
mc: <ERROR> Unable to prepare URL for copying. To copy or move 'https://host/bucket/something/prefix/' the --recursive flag is required.
This is wrong, the something/prefix/
key is valid and contains data I want: "lorem ipsom"
Notably:
aws s3 cp
deals with this edge case correctly
mc
actually does do this perfectly, it just has a different interface.
mc cat alias/bucket/something/prefix/
lorem ipsom
This is fantastic. I think this is a WAY better way of dealing with the edge case of a object key that ends with the separator
.
If you try to deal with it in the copy
interface you need to deal with uncomfortable questions like:
"what does a --recursive
on this object mean? Do we include it, do we exclude it? Now how do you create a file for it!? do we rename it? do we substitute the separator?"
This makes it clear that the user needs to deal with the data in whatever way they please: pipe it into a program (|
), write into a file (>
), load into a variable ( = $(...)
). It is up to the user.
This ALSO makes sense with how you got the data into the prefix/
key in the first place! you had to use mc pipe
which is the reciprocal for mc cat
!
In aws you need to know about the stream interface on cp:
echo "lorem ipsom" | aws cp - s3://bucket/something/prefix/
but inmc
it's all in their own correct interfaces!
Happy to close this, but I forget how the tracking works again: Do you need to close it to indicate your changes is the result, or does it not matter?
No problem we will close it.
@WesselAtWork happy to hear everything is working as expected.
Expected behaviour
I expected
mc cp --recursive bucket/directory/prefix- .
to recursively copy all the files/directory/prefix-*
into the local directory.Actual behaviour
Instead
mc
assumed that I meant to add a leading/
to the copy query (mc cp --recursive bucket/directory/prefix-/
) and added it for me.I could not find a flag or documentation on do a prefix copy and noticed that most everything that refers to a PATH is effected (
mv
for instance )I was surprised because listing behaved differently:
mc ls bucket/directory/prefix-
prints the/directory/prefix-*
files correctly.Steps to reproduce the behaviour
Attempt to refer to a prefix instead of a directory when using
mc cp
ormc mv
Did I misunderstand prefixes?
mc --version
System information
Using the container image
Workaround
Maybe actually the intended way but I am unsure of how heavy this request is. Image you have 10 000 objects in the bucket, does it list the entire bucket before it
finds
?mc find bucket/directory --name "prefix-*" --exec "mc cp {} ."