[github] method to download URLs from github

chantra commented 1 year ago

structs such as [Artifact[(https://docs.rs/octorust/latest/octorust/types/struct.Artifact.html) have a field where we can download the artifact.

This download MUST be authenticated or one would get the following error:

{
  "message": "You must have the actions scope to download artifacts.",
  "documentation_url": "https://docs.github.com/rest/reference/actions#download-an-artifact"
}

The Client already has all the details and logic needed for performing an authenticated request. It would be useful to re-use this and expose it to the application using the crate in some ways.

This change adds a naive get_url method that given a URL, will download the content and return the raw bytes.

chantra commented 1 year ago

Hi @augustuswm ,

This is a mere RFC but I am looking for feedback.

The core of the problem is explained in the diff summary. Here I did not bother with templating the code and just changed the code which is normally generated in order to play with it and validate that it works.

I did not handle httpcache as it seems it was assuming JSON payload, and again, for a POC, it was just the easy way.
To not leak creds, the method should check that the host in the url matches the Client's hosthost_override or error.
should probably implement a streamed version using https://docs.rs/reqwest/latest/reqwest/struct.Response.html#method.bytes_stream instead.

Anyway, looking for feedback. Thanks

augustuswm commented 1 year ago

My initial thought is that this would be better implemented as something like:

#[async_trait]
pub trait ArtifactExt {
  async fn download(&self) -> Result<Bytes, Error>;
}

impl ArtifactExt for Artifact {
  // ...
}

This would be a trait that exists in the client, but outside of the generated code. Some of the other clients here provide similar extension traits to expose functionality that is not explicitly defined by a spec file.

chantra commented 1 year ago

thanks for the feedback. I think we will still need a download method from within the client though, but it wont need to be exposed publicly.

chantra commented 1 year ago

As I was going down the rabbit-hole, I realize I had been totally oblivious to https://docs.rs/octorust/latest/octorust/actions/struct.Actions.html#method.download_artifact .... :facepalm:

chantra commented 1 year ago

So, currently this returns nothing...

The spec says (if I am reading it properly), that we expect a 302 and to look into the header's Location.

 "responses": {                                                          
   "302": {                                                              
     "description": "Response",                                          
     "headers": {                                                        
       "Location": {                                                     
         "$ref": "#/components/headers/location"                         
       }                                                                 
     }                                                                   
   }                                                                     
 },

location being defined as:

  "location": {                                                             
    "example": "https://pipelines.actions.githubusercontent.com/OhgS4QRKqmgx7bKC27GKU83jnQjyeqG8oIMTge8eqtheppcmw8/_apis/pipelines/1/runs/176/signedlogcontent?urlExpires=2020-01-24T18%3A10%3A31.5729946Z&urlSigningMethod=HMACV1&urlSignature=agG73JakPYkHrh06seAkvmH7rBR4Ji4c2%2B6a2ejYh3E%3D",
    "schema": {                                                             
      "type": "string"                                                      
    }

seems to be defined in https://github.com/OAI/OpenAPI-Specification/blob/main/versions/3.0.1.md#header-object

Now, looking into fn get<D>() where D: serde::de::DeserializeOwned + 'static + Send,, wehich then forwards to get_media which then forwards to request_entity, which itself forward to request . The latter follows redirect, so we eventually get the full payload in https://github.com/oxidecomputer/third-party-api-clients/blob/5480bb899ba9e36a43ced97303ed81e171f180e1/github/src/lib.rs#L557 but returns an empty payload because it does not have the right type ID, it returns here: https://github.com/oxidecomputer/third-party-api-clients/blob/5480bb899ba9e36a43ced97303ed81e171f180e1/github/src/lib.rs#L579

I suppose to go by the spec, we would need to return the Location URL. But as is, the get method and co have a strong assumptions about the content being json that can be deserialized into a GH type.

What do you reckon would be the best approach?

augustuswm commented 1 year ago

That all tracks, and is an unfortunate side effect of the generate methods returning Result<T> where T is the body type. The spec doesn't define a body type, so it makes sense that the generated method returns Result<()>. There are a few options:

Change all methods to return Result<Response<T>> - T being the deserialized body. I think this is the correct answer, but it is a change across all methods. Given the change from anyhow to a specific error type, it might be worth doing this all at once.

This would look like:

struct Response<T> {
  status: StatusCode,
  headers: HeaderMap,
  body: T,
}

Add new methods for endpoints that return informational headers only. This would be non-breaking, but leaves essentially useless methods in.

Given the upcoming error change, I think option one is what should be done. I can try to set aside some time for next week to assist on this.

chantra commented 1 year ago

Ok. Thanks for the quick reply.

I will very likely assume #57 to land roughly as is and work on top of it then. I will try to allocate some time next week to look into your suggestion.

chantra commented 1 year ago

Ok... so I looked into it and it is complicated :smile:

I read up a bit on https://swagger.io/docs/specification/describing-responses/, and it seems that to properly handling headers through get_response_type and get_fn_inner is non-trivial.

I took a slightly different approach in chantra/oxidecomputer-third-party-api-clients#1 (which applies on top of #57 ). The idea is that if we don't find a "response body" in the spec, we attempt to get a "response header".

Quickly checking, it seems only github sub-project is using this. And it also seem to be only using this for the "Location" case.

I did try this patch on a local fork and I successfully got the URL which I was able to subsequently curl.

One thing to note is that reqwest default to follow redirects. For this to work, redirect should be disabled with something similar to:

let http_builder = reqwest::Client::builder().redirect(reqwest::redirect::Policy::none());

so, this would not work out of the box with the non-custom github clients.

I disabled redirects, and it works fine on my limited use case.

augustuswm commented 1 year ago

My apologies, I probably confused this more than needed, but the core idea is that the client would not actually do anything special with responses.[status].header specifications. Maybe in the future we could add some kind of validation, but for now all methods would return all of the headers that were received from the call.

To describe it concretely, the download_artifact method would return:

let response: ClientResult<Response<()>> = // ...download_artifact(...).await;

It would then be the responsibility of the caller to unwrap the result, and call:

response.headers.get("location")

To this specific PR, we would then implement:

#[async_trait]
pub trait ArtifactExt {
  async fn download_contents(
    &self,
    owner: &str,
    repo: &str,
    artifact_id: i64,
    archive_format: &str,) -> ClientResult<Response<Bytes>>;
}

impl ArtifactExt for Artifact {
  async fn download_contents(
    &self,
    owner: &str,
    repo: &str,
    artifact_id: i64,
    archive_format: &str,) -> ClientResult<Response<Bytes>> {
    let response = self.download_artifact(owner, repo, artifact_id, archive_format).await?;
    let location = response.headers.get("location").ok_or_else(|| ClientError::FailedToFindDownloadUrl)?;
    // ... do operations to download the raw url (this needs something like get_url from this PR)
  }
}

I started a bit on this work, but going to wait until the ClientError changes are in to put together a PR.

chantra commented 1 year ago

I started a bit on this work, but going to wait until the ClientError changes are in to put together a PR.

Ok thanks. I did push an update to #57 which hopefully addresses your remaining feedback.

Just keep in mind the redirect issue. It is not clear to me if it is fine to disable automatic following of redirects globally.

The policy is applied to the reqwest::Client, the easiest would be to use policy redirect::Policy::none in https://docs.rs/octorust/0.3.0/src/octorust/lib.rs.html#302 . An alternative way is to, if redirecting URLs are known, to have a custom policy that returns attempt.stop() when the URL matches: https://docs.rs/reqwest/latest/reqwest/redirect/struct.Policy.html#example

augustuswm commented 1 year ago

I haven't done any kind of testing yet, but if you want to look at the response-type branch it has an initial sketch of this now.

chantra commented 1 year ago

I haven't done any kind of testing yet, but if you want to look at the response-type branch it has an initial sketch of this now.

Thanks @augustuswm

LGTM, can't test right know as I would need to change a bunch of code to compile, but I have a couple of feedback:

1) one thing re the trait.

Multiple objects may need to perform this. When I tested https://github.com/chantra/oxidecomputer-third-party-api-clients/pull/1, particularly this commit: https://github.com/chantra/oxidecomputer-third-party-api-clients/pull/1/commits/a23065d9dedc49aba8280940386061bd58cf2cdb

showed that

Methods such as download_artifact, download_job_logs_for_workflow_run, download_workflow_run_logs, download_tarball_archive, and download_zipball_archive return a redirect URL to get the content from.

So probably it cannot be a "Action" based trait. It can't even be object based as for instance, Repo has 1 method to download tarball, the other one to download zip. So one would need to write each individual method.

2) Also, I don't think the Client should be re-used to actually download the content. Those method return a Location to follow to fetch the content. The URL that is returned is not under the domain api.github.com anymore. Here is a redacted URL it returns: https://pipelines.actions.githubusercontent.com/serviceHosts/XXXXXX/_apis/pipelines/XXX/runs/XXXX/signedartifactscontent?artifactName=XXXX&urlExpires=SOMEDATE&urlSigningMethod=HMACV2&urlSignature=XXXX

As is, the authentication would be sent to this new domain which should not happen.

I think it may just be better to have the app just return the location where to get the content and leave the app itself download the content. As much as the name of the API endpoint is probably not the most accurate, the behaviour documented is as such too (e.g return an expiring URL to download the content) https://docs.github.com/en/rest/actions/artifacts?apiVersion=2022-11-28#download-an-artifact.

I think I am not too far off what may need to be done in https://github.com/chantra/oxidecomputer-third-party-api-clients/pull/1

e.g detect if the method should return a body response, if not, attempt to return a header.

The only thing I am afraid is that a response may contain a response body and response headers and then it becomes unclear what to use and I did not managed to find any authoritative answer for that yet.

Actually https://docs.github.com/en/rest/gists/gists?apiVersion=2022-11-28#create-a-gist does return both content and header, the header is a Location redirect (which makes me think that the API probably does not work as expected if it follows redirects currently as you would lose the return object with all its useful information.

augustuswm commented 1 year ago

Got it, I missed that it was returning signed urls. I agree that it should be left to the caller to perform the actual download, and we should not provide extensions.

I am fairly confident it is incorrect for the client to be following redirects at all. There is nothing in the spec file to specify what the return type of a function that represents redirected endpoint should be. Even worse if an endpoint specified both a redirect code and a body type, then the function would follow the redirect and try to deserialize the new body into a potentially incompatible type. This is all in addition to the point that it is valid to return both a redirect status code along with a response body, which would be lost if redirects were followed. It may not be common, but it isn't invalid.

As a result I think the client should be returning both the response body and the response headers to the caller. One possibility is that the client could try to validate that the headers returned match any of the headers specified for the route, but I'm not sure how useful that is.

chantra commented 1 year ago

As a result I think the client should be returning both the response body and the response headers to the caller. One possibility is that the client could try to validate that the headers returned match any of the headers specified for the route, but I'm not sure how useful that is.

SGTM. I suppose it would be more ergonomic for the high-level user API to still only get the content and not have to care about fetching either the body or some header content. e.g the current API won't change as much as the user is concerned (aside for the download_* methods).

chantra commented 1 year ago

I am having difficulties with our internal toolchain to upgrade to the latest octorust version (due to other dependencies). But I will try to validate this using a simple example program :)

chantra commented 1 year ago

ok, using https://gist.github.com/chantra/e0f07534cf4306a1fdbd68c80bebac52 I can get the artifact signed url.

chantra commented 1 year ago

Hi @augustuswm just checking if there is anything else you want me to test/do for this change?

augustuswm commented 1 year ago

Not from my side. I still need to do some work on making sure the return signatures actually make sense and work across the rest of the library.

chantra commented 1 year ago

Hi @augustuswm , checking in to see how this is going. If there is anything I can do to help, like some changes this change would depend on, or some refactoring that may be needed, let me know I will happily go and tackle it.

augustuswm commented 1 year ago

Thanks, the main things that I still need to get done for this are:

Reconcile the anyhow branch with main. There are a few fixes on that branch that need to get included. (And then bring response-type up to main)
Start a changelog file. Ideally this would have been in place before, but better late than never. I would like to get an explanation of the return type changes documented.
Go through doc comments to ensure they reference the new Response type and the structured error types.

This will likely all get built out on response-type branch.

augustuswm commented 1 year ago

Decided to combine this all on a new branch that can be found here https://github.com/oxidecomputer/third-party-api-clients/tree/rc-build . I still need to go through comments and examples to ensure that they still make sense.

chantra commented 1 year ago

I am assuming this is #75 . I will try to apply this on my project and report back. Thanks for doing this!

chantra commented 1 year ago

Closing this has #75 solves this problem. Thanks.

chantra commented 4 months ago

@augustuswm I just had a case where GH willingly redirects. For example, when trying to fetch a commit from a repo,

$ curl https://api.github.com/repos/iovisor/bpftrace/commits/c3169049b6cf96f6994a5e437cb14bafdb05e4bf
{
  "message": "Moved Permanently",
  "url": "https://api.github.com/repositories/146842101/commits/c3169049b6cf96f6994a5e437cb14bafdb05e4bf",
  "documentation_url": "https://docs.github.com/rest/guides/best-practices-for-using-the-rest-api#follow-redirects"
}

I will work around this by only disabling redirects for artifact downloads, but I wonder what would be a more sustainable way to do it for the library as a whole.

chantra commented 4 months ago

So, this is the kind of policy I ended up putting:


// A regex to match the path of the artifact download endpoint.
fn no_redirect_policy_path_matcher() -> Result<Regex, regex::Error> {
    // https://docs.github.com/en/rest/actions/artifacts?apiVersion=2022-11-28#download-an-artifact
    // At the time of this writting
    // > The :archive_format must be zip
    Regex::new(r"^/repos/[^/]+/[^/]+/actions/artifacts/[0-9]+/zip$")
}

// We disable redirects only for artifact downloads as we get
// the location to download the artifact through a Location header redirect.
// While I wished we could disable redirect wholesale, real life example has shown that
// this is not possible because GH may decide to redirects any endpoints. See D57763542 and
// https://github.com/oxidecomputer/third-party-api-clients/pull/62#issuecomment-2129936411
fn gh_redirect_policy() -> Result<reqwest::redirect::Policy, GhClientError> {
    let no_redirect_path_matcher = no_redirect_policy_path_matcher()?;
    let policy = reqwest::redirect::Policy::custom(move |attempt| {
        tracing::debug!("Attempting to redirect {:?}", attempt);
        // We only care about *not* following redirects when we last URL we are coming from
        // matches our regex.
        let previous = attempt.previous();
        if let Some(previous) = previous.last() {
            if no_redirect_path_matcher.is_match(previous.path()) {
                tracing::debug!("Stopping redirect from {:?}", previous);
                return attempt.stop();
            }
        }
        tracing::debug!("No match for {}", attempt.url().path());
        reqwest::redirect::Policy::default().redirect(attempt)
    });
    Ok(policy)
}

oxidecomputer / third-party-api-clients

[github] method to download URLs from github #62