kubevirt / containerized-data-importer

Data Import Service for kubernetes, designed with kubevirt in mind.
Apache License 2.0
420 stars 261 forks source link

deprecate import of local (file://) endpoints #338

Closed jeffvance closed 6 years ago

jeffvance commented 6 years ago

This is deprecation only. Actual removal of local (file://) endpoint support will be in a forthcoming issue. See pr #341

@nellyc @aglitke @davidvossel @erinboyd

jeffvance commented 6 years ago

Deprecation notice will read:

Support for local (file://) endpoints will be removed from CDI in the next release. There is no replacement and no work-around. All import endpoints must reference http(s) or s3 endpoints.

Steps:

erinboyd commented 6 years ago

@jeffvance did we ever officially claim in a release that we supported file:///? I thought it was just stubbed for internal testing use and wasn't publicized for external consumption. I don't want you to have to go to extra work to remove it if that is the case, we can just delete it and move on.

jeffvance commented 6 years ago

@erinboyd: we documented it in the readme and wrote tests that use it. We then documented in the readme that file:// is meant only for debugging and testing, and added that msg to the logs. But we got BZs and emails showing confusion and problems using file://. Now we are deprecating file:// and it can officially be deleted once we have a test framework (for unit and e2e) that serves files/images.

jeffvance commented 6 years ago

There are still non-code steps remaining. Pr 341 closed this too early.

jeffvance commented 6 years ago

See kubevirt-ansible issue 360 for local endpoint deprecation notice.

mhenriks commented 6 years ago

After working on the uploader for a bit, I think it would be desirable to keep the file:// endpoint code for internal use. I think the uploader will need it if/when we support multipart and/or resumable uploads. To support those features, the uploader can't start using the dataStream methods directly. The uploader will have to first copy the file to to a temporary location (after potentially collecting pieces of the file), then invoke the dataStream on the temporary file. Does that make sense? So, maybe can disallow file from "cdi.kubevirt.io/storage.import.endpoint" annotations (and wherever else import paths are specified) but keep it internally?

jeffvance commented 6 years ago

@davidvossel can you please comment on @mhenriks 's comments?

copejon commented 6 years ago

@mhenriks

The uploader will have to first copy the file to to a temporary location (after potentially collecting pieces of the file), then invoke the dataStream on the temporary file.

I'm a little confused by this - is there a reason we can't connect the byte stream to the dataStream object as an io.Reader directly w/o having to write/read from disk?

aglitke commented 6 years ago

On Mon, Aug 20, 2018 at 12:59 PM Michael Henriksen notifications@github.com wrote:

After working on the uploader for a bit, I think it would be desirable to keep the file:// endpoint code for internal use. I think the uploader will need it if/when we support multipart and/or resumable uploads. To support those features, the uploader can't start using the dataStream methods stuff directly. The uploader will have to first copy the file to to a temporary location (after potentially collecting pieces of the file), then

Hmm. Why wouldn’t the upload receiver pod just implement an http server which supports resume via offset and length http headers? This should just work out of the box with curl uploads.

invoke the dataStream on the temporary file. Does that make sense? So,

maybe can disallow file from "cdi.kubevirt.io/storage.import.endpoint" annotations (and wherever else import paths are specified) but keep it internally?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/kubevirt/containerized-data-importer/issues/338#issuecomment-414388744, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUV_5SSLMkKSP9Be5WBe20JPl9rgL14ks5uSurugaJpZM4V9AiJ .

-- Adam Litke

mhenriks commented 6 years ago

@copejon the dataStream code expects to read an entire file from beginning to end.

For the curl-esque upload case, this works is fine. We can pass the body of a POST to the dataStream.

This does not work for multipart because we will receive parts of the file out of order and have to reconstruct the file before operating on it.

This does not work for the resumable case because the dataStream will potentially read an incomplete file.

mhenriks commented 6 years ago

@aglitke the dataStream expects to read a complete file beginning-to-end. Synchronizing the dataStream object with multiple http requests (at potentially overlapping offsets) seems complicated. Saving to temp file is simpler approach. And will be needed for multipart anyway.

mhenriks commented 6 years ago

Thinking about this a little more: having dataStream read from file:// is not strictly required for upload, but I think it will make things simpler. Also, it's code that exists. And seems to work. But lacks tests. If we get rid of file:// here is how I think we can get around it for upload:

For at multipart upload, we have to support receiving data out of order. And this can be a lot of data, so it has to go to disk. If dataStream doesn't read from a single file on disk, some other reader will have to be created to read from disk to pass data in order to the dataStream. Maybe this could simply be a MultiReader that reads from individual uploaded file parts directly. Saving us from having to combine the parts into a single file. But we still have to store intermediate data on disk.

I haven't thought through the resumable case as much, but I guess we could return an error if the client sends data ahead of the next byte we are expecting to receive. This would keep us from having to store intermediate data on disk and would allow us to operate on the dataStream directly.

aglitke commented 6 years ago

If you have to store intermediate data to disk please use the volume itself and do in place conversion. Otherwise we risk repeating the issue we had with qcow conversion. The ephemeral pod filesystem should be considered very limited and our uploads will generally require multiple GBs of space.

mhenriks commented 6 years ago

@aglitke we are veering off the topic of this PR but I'm not sure what you mean by in place conversion. We can't convert qcow2 to raw in place.

Also, qemu-img convert doesn't work on streams (standard input) so for the case when a user uploads a qcow2 image, it has to be saved to intermediate storage first.

I agree that we shouldn't use the container filesystem for storage, but if looks like we do. I'll fix that.

func randTmpName(src string) string {
    ext := filepath.Ext(src)
    base := filepath.Base(src)
    base = base[:len(base)-len(ext)] // exclude extension
    randName := make([]byte, 8)
    rand.Read(randName)
    return filepath.Join(os.TempDir(), base+hex.EncodeToString(randName)+ext)

}
davidvossel commented 6 years ago

For at multipart upload, we have to support receiving data out of order.

@mhenriks do we really have to deal with receiving data out of order? I thought data ordering would only come into play once we considered resumable uploads.

From what I gathered, multipart uploads were handled by a single tcp connection as part of a single http request. Basically the http request says it's a multipart upload, there's some boundary characters set, and the client just starts streaming in data chunks separated by that boundary.

Uploading shouldn't be much different from the http import logic we already have. For imports, the pod with the pvc attached is acting as an http client to download a file. For Uploads, the pod with the pvc attached is acting as an http server to receive a file.

mhenriks commented 6 years ago

@davidvossel I think you are thinking of chunked transfer encoding [1]

Multipart uploads involve several, possibly concurrent, http requests, each sending a part of a file. Like what this javascript library [2] that was mentioned in the email thread does. Services like S3 also support multipart upload in one form or another. But far as I know, there is no standard for it.

[1] https://en.wikipedia.org/wiki/Chunked_transfer_encoding [2] https://github.com/23/resumable.js

davidvossel commented 6 years ago

@mhenriks I was talking about this https://www.w3.org/Protocols/rfc1341/7_2_Multipart.html

It's a single http POST request. The curl command would look something like this.

curl -X POST -H "Content-Type: multipart/mixed" -F someimage=@image.iso https://our-upload-proxy/v1alpha1/multipartupload

There's also the option of using a PUT request to upload the file in a single tcp stream.

curl -T image.iso https://our-upload-proxy/v1alpha1/upload

Multipart uploads involve several, possibly concurrent, http requests, each sending a part of a file. Like what this javascript library [2] that was mentioned in the email thread does. Services like S3 also support multipart upload in one form or another.

I see, you're talking about what happens when you run the aws cli tool to upload a large file using which uses multiple tcp streams decrease the upload time.

But far as I know, there is no standard for it.

right, I don't think there is. I think something like what you're describing here is the next iteration of the upload logic. Basic single stream uploads using a single PUT/POST request should likely be our first iteration. Once we get that working, then we should discuss both resumable uploads and this parallel upload logic.

s3 supports both too. You can use the fancy aws cli tool to do the super efficient parallel multi stream upload stuff, but they also work with a simple 'curl -T somefile ' as well.

jeffvance commented 6 years ago

@davidvossel @mhenriks @nellyc @aglitke has this issue been discussed enough to start the last remaining step above? If so I can create a pr placeholder to remove the code.

aglitke commented 6 years ago

@jeffvance I think you can create the placeholder PR to get this completed. We can then decide later if/when the time is right to merge that.

aglitke commented 6 years ago

@mhenriks mentioned that he no longer needs this support to implement upload. @jeffvance I think we can close this issue now.

jeffvance commented 6 years ago

ack @aglitke Will do that today.

jeffvance commented 6 years ago

@copejon @aglitke many of the importer unit tests use local (file://) endpoints. Will the new file server be a replacement for these local endpoints?