oras-project / oras

OCI registry client - managing content like artifacts, images, packages
https://oras.land
Apache License 2.0
1.49k stars 179 forks source link

support `--output -` to pull to stdout #346

Open ndeloof opened 2 years ago

ndeloof commented 2 years ago

My use case is to rely on oras CLI to restore a data cache, stored as tar.gz on pull, I'd like to pipe directly the downloaded artifact to tar xz

SteveLasker commented 2 years ago

Looks super interesting. Please open a PR for the proposal.

shizhMSFT commented 2 years ago

This is an interesting one. How would you handle multiple files?

FeynmanZhou commented 2 years ago

restore a data cache

Hi @ndeloof ,

Could you pls elaborate more on your use case?

Actually, ORAS CLI v0.15 will provide oras manifest fetch and oras blob fetch that might meet your need. You can check out this doc for details.

TerryHowe commented 1 year ago

Seems like the output of this should be a tgz so multiple files can be supported

shizhMSFT commented 1 year ago

One UX can be: oras pull localhost:5000/json-artifact:v1 --output - | jq and oras returns error if there are multiple blobs associated with the target manifest.

ProbstDJakob commented 1 year ago

Are there any plans to implement this also for the input so that something like the following will be possible:

command-a | command-b | oras push localhost:5000/json-artifact:v1 -

...

oras pull localhost:5000/json-artifact:v1 --output - | jq
qweeah commented 1 year ago

@ProbstDJakob There is a plan to provide piped-command user experience in v1.2.0 by standardizing output, see https://github.com/oras-project/oras/issues/638

Still I have questions on below commands:

command-a | command-b | oras push localhost:5000/json-artifact:v1 - 

Since the layer content comes from stdin not file, 1) What is the file name of the generated layer? 2) How should we name the layer if user runs oras pull localhost:5000/json-artifact:v1?

oras pull localhost:5000/json-artifact:v1 --output - | jq

What if localhost:5000/json-artifact:v1 contains multiple layers?

ProbstDJakob commented 1 year ago

oras push could receive an additional option as follows:

--from-stdin[=file-path[:type]]
    oras will read data from the stdin and write it to `file-path` within the image. If `file-path` has not
    been supplied it defaults to `./stdin.blob` with the type `application/octet-stream`. This option can be
    used in conjunction with other files supplied via `<file>[:type] [...]` but does not need to. The only
    exception is that there must not be supplied another `-` file.

...

<file>[:type] [...]
    The files to include within the image. The special file `-[:type]` is equivalent to using the option
    `--form-stdin=./stdin.blob[:type]` where if no type has been supplied the type
    `application/octet-stream` will be used.

Regarding your second question, I am not that familiar with how OCI images work, thus I am currently unable to answer your question, but I am willing to study the docs in order to further elaborate your question if the answer above doesn't solve it implicitly.

For oras pull there might be a similar option:

--to-stdout[:<single|tar>][=file-path,...]
    Instead of writing the content of the image to a directory, the content will be written to stdout.

    When supplying `--to-stdout:single[=file-path]` the file found at `file-path` within the image will be
    written to stdout without converting it to an archive. If no `file-path` has been supplied and the image
    contains exactly one file this will be written out, otherwise the command will fail. If more than one
    `file-path` has been supplied the command will also fail.

    When supplying `--to-stdout:tar[=file-path,...]` the files found at `file-path,...` will be written to
    standard out by combining the files within an uncompressed tar archive. If no files have been supplied,
    all files within the image will be included in the archive.

    Aliases:
    `--to-stdout=<file-path>` => `--to-stdout:single=<file-path>`
    `--to-stdout=<file-path,...>` => `--to-stdout:tar=<file-path,...>`
    `--to-stdout` => `--to-stdout:single`

    This option is mutually exclusive with the `--output` option.

Regarding the penultimate line, I am not quite confident if this is the right choice (defaulting to single), but I think most users would try to pipe a single file instead of a whole archive.

guettli commented 11 months ago

Just for the records, I found this solution for me to stream the content of an artifact to stdout:

oras blob fetch -o- ghcr.io/foo/test@$(oras manifest fetch ghcr.io/foo/test:0.0.1  | yq '.layers[0].digest')

I pushed the tgz like this:

oras push ghcr.io/foo/test:0.0.1 --artifact-type application/vnd.foo.machine-image.v1 image.tgz

This solves my use case, but it would be great to do that without yq (in a single oras call).

qweeah commented 11 months ago

I pushed the tgz like this:

oras push ghcr.io/foo/test:0.0.1 --artifact-type application/vnd.foo.machine-image.v1 image.tgz

@guettli This is very interesting. May I know what's stored inside the image.tgz and how it is generated?

If you provided a folder but not a file, oras push can help pack and oras pull can unpack automatically. If your end-to-end scenario fits into this, you may try

oras push ghcr.io/foo/test:0.0.1 --artifact-type application/vnd.foo.machine-image.v1 image # pack and push all files in folder image
oras pull ghcr.io/foo/test:0.0.1 -o pulled # pull and unpack files into folder pulled/image
guettli commented 11 months ago

@qweeah thank you for asking. The tgz contains a linux root file system. We booted Ubuntu on a VM, then we installed some tools and applied some configuration, and then we create a tgz, so that we have constant custom image. The image is about 1.8 GByte and contains 100k files.

I am happy to store the tgz as blob in an artifact. Nice to know that you could use oras for tar/untar, too. But at the moment I don't see big the benefit.

One drawback of the current method: We can't create the artifact via streaming. AFAIK something like this is not supported yet:

tar -czf- .... | oras push ...

@qweeah what benefit would we have if we would use oras instead of tar/untar?

qweeah commented 11 months ago

Before upload any blob to a registry, the digest must be specified.

Unless you can get the digest before archiving is done, Otherwise it's not possible to do the streamed uploading.

qweeah commented 11 months ago

@qweeah what benefit would we have if we would use oras instead of tar/untar?

Well, rather than using oras manifest fetch + oras blob fetch, you can use only one command oras pull to do the pulling.

ProbstDJakob commented 11 months ago

Before upload any blob to a registry, the digest must be specified.

Unless you can get the digest before archiving is done, Otherwise it's not possible to do the streamed uploading.

In order to circumvent this oras could buffer the input stream in memory until for example 64MiB and if this threshold has been reached oras pauses reading new input and first writes the 64MiB into a temporary file with narrow access rights and then resumes reading from the input and directly pipe it into the file. After reaching the EOF oras could calculate the digest either from the in memory buffer or if the content was too large from the file, pack it, and upload the image.

The buffering in memory would only be for performance (and security) reasons, but would mostly be a nice to have feature.

qweeah commented 11 months ago

Before upload any blob to a registry, the digest must be specified. Unless you can get the digest before archiving is done, Otherwise it's not possible to do the streamed uploading.

In order to circumvent this oras could buffer the input stream in memory until for example 64MiB and if this threshold has been reached oras pauses reading new input and first writes the 64MiB into a temporary file with narrow access rights and then resumes reading from the input and directly pipe it into the file. After reaching the EOF oras could calculate the digest either from the in memory buffer or if the content was too large from the file, pack it, and upload the image.

The buffering in memory would only be for performance (and security) reasons, but would mostly be a nice to have feature.

It's not sth oras can circumvent, you cannot get the checksum of the blob before tar finishes writing

ProbstDJakob commented 11 months ago

I know that is why I proposed the solution with buffering/writing to a temporary file. Thus the calculation of the digest can be done after tar finishes without the need of creating/deleting a temporary file by oneself and therefore support streaming.

qweeah commented 11 months ago

Yes, the digest calculation can be done while packing and this optimization has already been applied in oras-go.

The question is after getting the digest, oras CLI still need to go through the archive file to upload it.

ProbstDJakob commented 11 months ago

Sorry for the late response. Maybe I do not know enough about how oras works, but wouldn't the proposed solution be equivalent to supplying files as arguments but instead of including the files from the arguments the only file to include is the buffer/temporary file?

Maybe the following pseudo script will help you understand my suggestion:

uint8[64MiB] buffer;

read(into=buffer, from=stdin);
Readable inputData;

if (peek(stdio) == EOF) {
  inputData = buffer;
else {
  File tmpFile = tmpFileCreate();
  write(to=tmpFile, from=buffer);
  readAll(into=tmpFile, from=stdin);

  seek(origin=START, offset=0, file=tmpFile);
  inputData = tmpFile;
}

call oras push registry.example:5000 inputData # Yes the CLI is not able to accept buffers, but I hope you get what I intend to say
qweeah commented 11 months ago

@ProbstDJakob Besides from the seek operation, what you described is already implemented in here.

P.S. I think this discussion has gone too far from this issue and I have created https://github.com/oras-project/oras/issues/1200 so we can continue there.

ProbstDJakob commented 11 months ago

The following script is a real world example where streaming could come in handy.

Background

We fully manage the life cycle of an OpenShift cluster via a GitLab Pipeline. When creating a cluster with the openshift-install tool some files like terraform state and kube-configs will be created. Those files are needed during the whole life cycle of the cluster (not only in the current pipeline), thus they need to be stored persistently. In our case we use the existing GitLab registry and oras to create an image.

Current way to pull the artifacts from the registry

#!/usr/bin/env sh
set -eu

# [...] some preparations

tempDir="$(mktemp -d)"
oras pull --output "$tempDir" "$ENCRYPTED_OPENSHIFT_INSTALL_ARTIFACTS_IMAGE"
sops --decrypt --input-type binary --output-type binary "$tempDir/openshift-install-artifacts.tar.gz.enc" \
  | tar -xzC "$CI_PROJECT_DIR"
rm -rf "$tempDir"

Possible way to pull the artifacts from the registry with pipelining

#!/usr/bin/env sh
set -eu

# [...] some preparations

oras pull --output - "$ENCRYPTED_OPENSHIFT_INSTALL_ARTIFACTS_IMAGE" \
  | sops --decrypt --input-type binary --output-type binary /dev/stdin \
  | tar -xzC "$CI_PROJECT_DIR"

This way there is no need to create a temporary directory and to know how the file is called within the image (not a problem for us since we named it within the same repo).

Counterpart

See https://github.com/oras-project/oras/issues/1200#issuecomment-1849733309