vmware-archive / fly

old - now lives in https://github.com/concourse/concourse
BSD 2-Clause "Simplified" License
100 stars 73 forks source link

upload request failed: Put /api/v1/pipes/5dbfe572-45b0-45d2-517a-e1864363608b: unsupported protocol scheme "" #138

Closed airtonix closed 7 years ago

airtonix commented 7 years ago

Going through the tutorial, I assumed I could transpose my own concourse server (running on coreos via rancher):

~/Projects/Others/concourse-tutorial/02_task_inputs master*
❯ fly -t tutorial e -c inputs_required.yml -i some-important-input=.        
executing build 16
upload request failed: Put /api/v1/pipes/5dbfe572-45b0-45d2-517a-e1864363608b: unsupported protocol scheme ""
curl: (3) <url> malformed
gzip: invalid magic
tar: Child returned status 1
tar: Error is not recoverable: exiting now
exit status 2
failed
~/Projects/Others/concourse-tutorial/02_task_inputs master*
❯ cat ~/.flyrc
targets:
  lite:
    api: http://concourse.rancher.fusion.one
    team: main
    token:
      type: Bearer
      value: eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE0NzkyNzA0MTgsImlzQWRtaW4iOnRydWUsInRlYW1JRCI6MSwidGVhbU5hbWUiOiJtYWluIn0.ZHzm02X2NsyVOuMMJCmPU9PulH-hJsw8HwP3kbayIVk8tC7qxrtgoouwLPNpDlMlckY0FLq0ZekolYd7ixo9H3zuNPwlZD9aZgZC9_B-vUobLLcuXY_nVJblBNTdSvd8r58iVQxc1Lif3nGNdc0w8zfBI62zc52MUNp7PGtfwBPco3A6F8-Gz1IqV4GuPnJhPL7X9M_RPM0MrK8NlqJAh997gL7wSliUZdFIGt9ONwBjsnoeyn4THS39sKeZRwz8fymZHSJbA4f55TNkfT_DZ65F9LeWE0yD0sv3kdar1NJWn3xJ09MX3rKx5C7Gx_CmrodGBus-Ix2IPoHWTmRdAQ
  main:
    api: http://concourse.rancher.fusion.one
    team: main
    token:
      type: Bearer
      value: eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE0ODE4NDg5MzUsImlzQWRtaW4iOnRydWUsInRlYW1JRCI6MSwidGVhbU5hbWUiOiJtYWluIn0.KHfgN0C_YVYcTes2ayJczGjOnTMlWh5mthaBmomUb70eR0qA84LR1iQAhLQZbtX34aq-rqJVdSr4kQg1ev9dQtnpVEPaWe_8-YpJaQVNchgU58ZiZFra41Nfy0LCXNjYmiHXvrVB3SUgFyy5-1JML-REasNvmwblwyCgccXCAaw_9yXLnwT9rutJIxwoLa6WBFTj03On9ok7_9linakCz7ZnP8vPFLc37BMyX1ZTmMQcncgQ7nCavNjP_rvabqPm2kQVCSK8pbqqXIKwYjnoWCDArW1PG48DyrAaqvLAO-7bq20lCgzlxXMGH0gKyz4pylmuijHcj08k4TA-QvrPUA
  tutorial:
    api: http://concourse.rancher.fusion.one
    team: tutorial
    token:
      type: Bearer
      value: eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE0ODE4NDg5NTIsImlzQWRtaW4iOnRydWUsInRlYW1JRCI6MSwidGVhbU5hbWUiOiJtYWluIn0.LG97MGm_W-gpwX9XCZAKM9QI2_BX6Yb3lx-j0YWiDOnrQXBvqXs3nj6dRfvpzKrXP-gkufs94lKLiJymciJAo56q-IijZHVJETRGV7uee2tFipbu6mG490YWdDeKGJJIepY13PClkiDwu-HdGrIXS6tvzxmn4a6kdZ3q5W4GK8OIOxtfl5ekw-qfnM-BfJgR6ysIEo4V-ctMpfaN_aiImzmaFWV5JLdGipMyaAS-RRQZWPzOmEitMAhXXpDoHcBnBS8KMwz0hqCzwahbvvWayOD4VdOzTcBYeExtH0HkoQPHlQ3XWpHJjkQPfjivWwLhAV_6fr7K0TqLe3BbWTo9OQ
~/Projects/Others/concourse-tutorial/02_task_inputs master*
❯ cd ../01_task_hello_world 

~/Projects/Others/concourse-tutorial/01_task_hello_world master*
❯ fly -t tutorial execute -c task_hello_world.yml                           
executing build 17
initializing
running uname -a
Linux 2b8af31c-ed0b-43b9-411f-ab67345418f7 4.4.21-rancher #1 SMP Sat Oct 15 07:53:05 UTC 2016 x86_64 Linux
succeeded
~/Projects/Others/concourse-tutorial/01_task_hello_world master* 8s
❯ fly -t tutorial pipelines
name  paused  public
concourse-bot commented 7 years ago

Hi there!

We use Pivotal Tracker to provide visibility into what our team is working on. A story for this issue has been automatically created.

The current status is as follows:

This comment, as well as the labels on the issue, will be automatically updated as the status in Tracker changes.

vito commented 7 years ago

Do your web nodes configure a peer URL and external URL?

deitch commented 7 years ago

Bingo. Been looking for this. I, too, am running through the tutorial, and got the same issue when trying to pass -i inputs. Running execute tasks without command-line inputs is fine, but it fails on this.

Do your web nodes configure a peer URL and external URL?

I have it all running as docker containers on a single node (Docker for Mac FWIW), just used the default docker-compose.yml from the "Installing" guide on http://concourse.ci

So, no, didn't configure EXTERNAL_URL, because didn't think it mattered, all running locally. Does it?

deitch commented 7 years ago

Now I am more confused than ever. I did try setting CONCOURSE_EXTERNAL_URL to https://127.0.0.1:8080 (although the docs say not to), just to set it to something. Of course the job fails:

executing build 1
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
gzip: invalid magic
tar: Child returned status 1
tar: Error is not recoverable: exiting now
exit status 2

I try unsetting it, all jobs hang on initializing.

To a large degree, the problem is that it is unclear what the purpose of CONCOURSE_EXTERNAL_URL is and how to work with it in various scenarios.

Based on the config, it looks like workers connect via CONCOURSE_TSA_HOST, so it isn't for the worker. What exactly is its purpose, and how do you handle it in various scenarios? The docs say:

It can't be 127.0.0.1 or localhost as it has to also work in a separate network namespace, for execute to be able to receive your uploaded bits from the web node.

OK, so it isn't for external access, since execute is using it (from the worker node, to "receive uploaded bits"?). But sometimes the external URL is 127.0.0.1, e.g. port-mapped docker-compose run.

I am very confused.

deitch commented 7 years ago

OK, ignore the initializing problem. That is something else, unable to connect to the docker registry. I will open a separate issue.

airtonix commented 7 years ago

The URL would need to be the IP as seen from the fly client I suppose?

On Tue., 20 Dec. 2016, 22:29 Avi Deitcher, notifications@github.com wrote:

OK, ignore the initializing problem. That is something else, unable to connect to the docker registry. I will open a separate issue.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/concourse/fly/issues/138#issuecomment-268226022, or mute the thread https://github.com/notifications/unsubscribe-auth/AADvKcCsW6gMDrd_lusFYWJ3UrRFVfNAks5rJ8M-gaJpZM4LNmNE .

deitch commented 7 years ago

The URL would need to be the IP as seen from the fly client I suppose?

Based on the name, that would make sense. But what I don't get is:

  1. Why does it need it?
  2. What do you do if you are accessing it via 127.0.0.1, e.g. running it in docker portmapped, or the binary locally?
vito commented 7 years ago

@deitch It needs to be an address that the workers can reach. The way execute gets the inputs to the workers is to have fly upload them to the ATC's "pipe" API, and pass a download URL along to the workers. So the workers need to be able to reach it.

We default it to 127.0.0.1 mainly for development purposes as it's also used for aspects of the web UI where it doesn't matter as all traffic will be from the browser. That'll never work for execute though, so admittedly it's not a great first-time experience, but there's not really a sane default for it imo, beyond guessing your network stack.

You'll want to set it to your machine's LAN IP, usually 172.x.x.x for Docker or 10.x.x.x for e.g. a binary in an private network (e.g. AWS VPC). Or, if you have a domain set up (maybe pointed to a load balancer), configure that as the external URL. Ours, for example, is https://ci.concourse.ci.

deitch commented 7 years ago

@vito thanks for the response. Let me see if I get this.

  1. I execute a job with some input context using fly
  2. The fly client zips up the input context and loads it via the API. Now it is available on the ATC node
  3. The ATC calls a worker, which now needs to download that context.
  4. The ATC passes the work a URL from which it can download the context

That URL in the last step is the one that is defined by the CONCOURSE_EXTERNAL_URL.

Is that right?

vito commented 7 years ago

@deitch Yep. Technically the ATC and worker don't know the specifics of any of this; fly puts together a build plan that just has the read end of the pipe configured as the download URL for the get of the archive resource.

deitch commented 7 years ago

Got it. So CONCOURSE_EXTERNAL_URL really means, "url at which workers can access the ATC".

And if I understand you correctly, it is fly that puts together the plan that is just transmitted by the ATC to the workers.

May I raise a number of issues?

  1. The naming is confusing. It really should be CONCOURSE_ATC_WORKER_URL or something like that. It has little to do with external, but lots to do with internal (i.e. worker access).
  2. It creates problems. What if I have an external load balancer in front of the server, so my fly client talks to one URL, but workers should talk to another?
  3. It doesn't scale. Let's say I have a small team of 30 devs all of whom need to execute jobs. Only a few of them (or none) know the URLs used internally, and those might change. They only should need to know the true external API URL.
  4. Honestly? Eliminate it entirely. It is brittle, and makes adoption confusing. As long as ATC can communicate with workers (by definition, since the ATC communicates the job), workers can communicate with ATC. I would even go so far as to say that the ATC-initiated communication to worker should be the same socket over which the context is downloaded.
  5. Get the logic out of fly. If the build plan is put together by the client CLI, much logic is embedded in there, and makes tight coupling between the CLI version and the server version (and architecture).

FWIW, kubernetes went through something similar with certain parts of kubectl. It had a lot of logic embedded in it, where it made the detailed API calls, which in turn meant things like stopped communications broke midstream, couldn't "fire and forget", and changes got messy. They moved logic into the server so it became a simple standard API call (kubectl became lighter API wrapper), and things were dramatically simpler.

Just $0.02...

vito commented 7 years ago

No, it does not mean that; it's used for much more than just worker communication. It means precisely "URL at which anything can reach the ATC(s)".

  1. It's named exactly what it should be for the multitude of things it's used for: resource metadata (so notifications can link to their own builds), the auth API (so fly can link to an auth endpoint without knowing particulars about the ATC API), the pagination API (Link headers), the pipes API (so fly can just blindly pass along a download URL). They all have the same generic need for the externally-reachable URL. This is also a pretty common thing to configure in other applications.
  2. If you have an external load balancer, that's exactly what you should configure as the external URL.
  3. The true external API URL is all they need to know, and all that should be configured.
  4. Even without this use case it's needed for other things. Again, if it's set to the right value, this problem goes away.
  5. What logic in particular? It's just passing along a download URL, it doesn't care about external or worker URLs.
deitch commented 7 years ago

If you have an external load balancer, that's exactly what you should configure as the external URL

Workers that are inside a private subnet should try to reach another endpoint in the same private subnet via an external load balancer in a public subnet that is accessible only from the Internet?

It's named exactly what it should be for the multitude of things it's used for: resource metadata (so notifications can link to their own builds), the auth API (so fly can link to an auth endpoint without knowing particulars about the ATC API), the pagination API (Link headers), the pipes API (so fly can just blindly pass along a download URL). They all have the same generic need for the externally-reachable URL.

That's the part I don't get. Why should a client need to know anything other than the single URL to access a service? If I use kubectl, all I need is the k8s master URL; if I use Rancher, all I use is the Rancher URL; if I connect to Salesforce, all I need is the Salesforce URL.

This is also a pretty common thing to configure in other applications I haven't seen it (which may speak more to what I have seen). Usually I see a single API endpoint to which a CLI communicates over an API. Everything else is managed by the service endpoint.

Is there an architectural and services diagram that shows all of these components, their purposes, what they use and how they communicate?

vito commented 7 years ago

Workers that are inside a private subnet should try to reach another endpoint in the same private subnet via an external load balancer in a public subnet that is accessible only from the Internet?

Yes, because otherwise you have to allow your containers to call into your private network, which is a terrible idea if you're running things like pull-requests (i.e. untrusted code) or are running a multi-tenant Concourse instance. It would also require more internal plumbing to not just use the external URL, as suddenly you may need to pass around all internal IPs of the cluster rather than just a single URL pointing to a load balancer. Given that we already need the external URL configured not just for this use case, I really don't see why we would want a second value to configure.

That's the part I don't get. Why should a client need to know anything other than the single URL to access a service? If I use kubectl, all I need is the k8s master URL; if I use Rancher, all I use is the Rancher URL; if I connect to Salesforce, all I need is the Salesforce URL.

I don't get what you mean. The external URL is the exact value the client would use to access the service. It is a single URL.

I haven't seen it (which may speak more to what I have seen). Usually I see a single API endpoint to which a CLI communicates over an API. Everything else is managed by the service endpoint.

Grafana, for example, requires you to configure the external URL for oAuth callbacks. Same goes for Concourse. The same value happens used in many scenarios, so we gave it a single config point with a generic name.

Is there an architectural and services diagram that shows all of these components, their purposes, what they use and how they communicate?

There's https://www.gliffy.com/go/publish/10463597 which hasn't been integrated into the docs yet. But this topic may be difficult to work into a general architecture diagram as the components don't strictly know that they're talking to each other in the fly execute case. That may be better covered by its own document.

deitch commented 7 years ago

Yes, because otherwise you have to allow your containers to call into your private network, which is a terrible idea if you're running things like pull-requests

So the whole purpose here is the internal runc containers and their access to the context?

There's https://www.gliffy.com/go/publish/10463597

That is a good start. It looks like Garden runs the containers, and Baggageclaim downloads the context(s) and mounts them as volumes into the containers. So is it Baggageclaim that needs the API for access?

This is a good start, but would be great to see a lot of the flows and how the parts interact.

vito commented 7 years ago

So the whole purpose here is the internal runc containers and their access to the context?

Yep. All network traffic in Concourse happens in containers; Baggageclaim or Garden or ATC never call out to e.g. GitHub themselves.

That is a good start. It looks like Garden runs the containers, and Baggageclaim downloads the context(s) and mounts them as volumes into the containers. So is it Baggageclaim that needs the API for access?

Baggageclaim is about as semantically smart as Garden, that is to say it isn't - ATC is just driving them both. It'll e.g. talk to Baggageclaim to create a volume, and then talk to Garden to mount that volume into a container, and then use that container to download stuff into the volume.

That container is then being told, "run /opt/resource/in with this JSON resource request", which then tells the archive resource to download something, in this case <external url>/api/v1/pipes/sdfkjhadskjhdsf, as fly told it to (after creating the pipe and starting to upload to its write end).

deitch commented 7 years ago

The architectural design here really is interesting. It is a pity there isn't a full write-up (well, wasn't, until you started describing it here :-) ).

So the process is something like:

  1. fly tells ATC to create a job with a given context
  2. fly zips up the context and uploads to ATC
  3. ATC tells Garden to create a runc container to run the job
  4. ATC tells Baggageclaim to create a volume for the context
  5. ATC tells Garden to tell the previously created container to mount the previously created volume
  6. ATC tells Garden to tell the previously created container to download the previously uploaded context from ATC into the previously created and mounted volume
  7. ATC tells Garden to tell the previously created container to run the job, since everything now is set up

I don't actually get why you need the volume, since data all is ephemeral and will be thrown out after the task anyways, but it doesn't affect the meat of this discussion.

The part that still seems strange to me is to require a fixed and external URL for the container to access the zipped context. Since it basically is downloading a single zip file (or perhaps multiple), but using a known and controlled protocol (http), and that is entirely in the contract between ATC<->Garden<->runc container, I can see several other ways to do it:

  1. Auto-generate a URL by which ATC exposes its downloads to the containers. ATC knows where it lives, at least with respect to inside its cluster.
  2. Push the data to Garden. ATC knows how to find and communicate with Garden, it already supplies all other information, let it supply context files as well.
  3. Push the data to Baggageclaim. When creating the volume(s), tell Baggageclaim, "create the following volume and populate it with the following data." Or do it in 2 steps, create and then populate. It is communicating over http, not too hard to POST or PUT chunks of data.

My main concern is that the current structure is not maintaining separation between the external (fly, its config, and its communications channel with the API endpoint, which happens to be ATC), and internal (Garden, Baggageclaim, runc containers, volumes, and their communications channels with ATC). That, in turn, makes it brittle, e.g. when your IP actually is 127.0.0.1, and requires extra config, which makes it hard to move around and creates barriers to adoption.

Think about it. If I have concourse installed to run locally in a docker container, I would port-map ATC so I know fly always can reach it at, say, http://127.0.0.1:8555/ or similar, and save that in my config. But then my CONCOURSE_EXTERNAL_URL will change every time I set up and tear down?

I actually think this is one of the best process-structured CI/CD I have seen in a long time. I came to it because I needed a system for a proof-of-concept for a new cloud migration. They wanted Jenkins, some people wanted CircleCI or Travis, etc. Architect friend recommended this, I loved it. Pipelines are great, everything makes sense, and I love how you use runc containers making it actually work even if the workers are docker containers. Brilliant! Sure, it still needs a larger ecosystem, but give it time (and some marketing love).

But the whole comms flow that depends on this var and its setting and its network sensitivity made running local via compose v1 and compose v2 then rancher than k8s and all of the various networking structures difficult for them to swallow.

So, yeah, not critical, trying to be helpful. :-)

vito commented 7 years ago

Yep, except for step 7 it'd actually be a second container that runs the task, using a copy-on-write volume of the volume the container in step 6 fetched into.

This is all in service of having everything speak one language: build plans. One way to look at it is that there are two consumers of the build plan API: fly execute and pipelines. There may be other use cases to have other tools generate build plans and submit them as one-offs to the ATC API. That is to say, it's an important architectural decision that fly execute is not a special case, and that the ATC API is expressive enough to support it among future potential clients.

To that end, part of the mental model for build plans is that every artifact that is provided to the build from an external source is via a get. The user's bits are an external source (it's on their laptop). All get steps fetch using resources, so fly uses the archive resource, which just does a simple curl and tar zxf of some URL it's given. A URL is a good generic starting point, but let's work backwards from there - how do we get a URL for the bits the user has on their local disk? We also want fancy things like progress bars so they're not just sitting there waiting.

The ATC provides a generic /api/v1/pipes endpoint for creating one-off endpoints for exactly this reason (basically what you said in point 1 in your suggestions). It's a simple API: create a pipe, get a GUID, PUT to one end, GET to the other, and it streams. So the archive resource is told to curl the read end, and you get a progress bar, and the bits end up in the container.

The key point here is that none of this is specialized for fly execute. Your steps 1-7 are the exact same steps we perform for any resource. Modeling it this way keeps the internals generic, the API consistent, and the client is pretty simple. It also works even if your workers live in a private network. Remember, there's no guarantee that your workers can necessarily reach the ATC by its internal IP. It may be a Mac Mini running on your office desk for running iOS testing, registered as a worker with a cluster running in AWS or GCP.

We could follow suggestion 3, but that would require some special-casing and some additional complexity in the build plans API, which I'd like to avoid if we can help it. IMO we can help it since, again, we already need the external URL for various other reasons.

Also no worries. I've been wanting to do a blog or something about "anatomy of fly execute". It's always fun to tell people that the progress indicator they see is actually the other end downloading from them, and technically not an upload progress bar. :)

deitch commented 7 years ago

Also no worries. I've been wanting to do a blog or something about "anatomy of fly execute". It's always fun to tell people that the progress indicator they see is actually the other end downloading from them, and technically not an upload progress bar. :)

Yes! I imagine half of my questions would disappear if that had existed (probably to be replaced by others...). I hope you do collate these along with the diagram and some flow diagrams to make such a solid series of posts.

To the subject, though...

Yep, except for step 7 it'd actually be a second container that runs the task, using a copy-on-write volume of the volume the container in step 6 fetched into.

OK, makes sense.

an important architectural decision that fly execute is not a special case, and that the ATC API is expressive enough to support it among future potential clients

No argument there. I'm a huge proponent of clean APIs, and the clients - even shipped with product - are nothing special.

It's a simple API: create a pipe, get a GUID, PUT to one end, GET to the other, and it streams.

So ATC's pipes API really is just a proxy, a streaming pipe? client -> ATC -> container (to volume) as a single stream? Or is it really 2-step: client create .tar.gz and uploaded; container pulls?

Remember, there's no guarantee that your workers can necessarily reach the ATC by its internal IP.

Got that. But there is a guarantee that the ATC can reach the workers (or it couldn't start jobs, get results, etc.), which is why I thought options 2 or 3 or any variation with push made sense.

We could follow suggestion 3, but that would require some special-casing and some additional complexity in the build plans API

Because the same API is used by the fly client and the pipeline, so we would need a way to tell it, "I am pushing to you"?

Still unclear about something. I think you are saying:

Is that about right?

Going back to my earlier comment, flow diagrams with steps and API calls would be incredibly helpful. As you said, there is a "mental model" here, but it isn't instantly visible.

In any case, if so, would option 3 would become something like:

  1. fly upload to ATC
  2. ATC, when talking to Garden to kick off a pipeline job using the API, push the context bits down to the container
  3. Generate a local URL for the container, since it already is pushed, thus keeping with the same (or similar) API

Call it pre-loading?

vito commented 7 years ago

So ATC's pipes API really is just a proxy, a streaming pipe? client -> ATC -> container (to volume) as a single stream? Or is it really 2-step: client create .tar.gz and uploaded; container pulls?

Yep, the pipes API is a streaming pipe. There's no "upload then download", it's just bits flow straight through. So there's no state to it, either.

pipelines, however, have to be told, "go execute this", so ATC instantiates that process via Garden's http API, telling a worker, "go start a pipeline".

The workers don't know anything about pipelines or builds - they just run volumes and containers, and are generic components usable outside Concourse. Garden for example us used by Cloud Foundry for running web apps. ATC is the brains behind it all.

There's also no such thing as really "starting a pipeline" - a pipeline is just a set of jobs, which describe a plan for builds to run, describing their dependencies (get) and when to run them (trigger: true).

Technically, the ATC is talking to itself - it's the one also scheduling pipelines and creating those build plans. But the point is at the end of the day there's a concrete "build plan" to run, which is the same "build plan" that fly execute constructs and submits to the ATC API.

In any case, if so, would option 3 would become something like: (snipped)

It'd probably entail introducing a new "get from pipe" step wherein the ATC would just create a volume and stream the pipe into it. But at that point we'd probably replace the 'pipe' API with something else that's more integrated with the flow.

deitch commented 7 years ago

it's just bits flow straight through

Hence the name, "pipes". :-)

The workers don't know anything about pipelines or builds - they just run volumes and containers ... ... Technically, the ATC is talking to itself ...

So what tells Garden to create a new container, or Baggageclaim to create a volume? How is it given the parameters of what tasks to run, what binaries it needs to run them? How is it told to download the data from the pipe?

vito commented 7 years ago

So what tells Garden to create a new container, or Baggageclaim to create a volume? How is it given the parameters of what tasks to run, what binaries it needs to run them? How is it told to download the data from the pipe?

ATC. It's given them via the /api/v1/builds endpoint that Fly submits a build plan to. Fly's build plan includes a get of type archive with the URL as the read end of the pipe.

deitch commented 7 years ago

I think I need to try and find time to dig into it. I need to understand the parts (that diagram helps), the flows, and where that URL is used.

chendrix commented 7 years ago

Moved to concourse/concourse#1102