wangkuiyi / ci

3 stars 4 forks source link

A New Design #7

Closed wangkuiyi closed 8 years ago

wangkuiyi commented 8 years ago

Thanks to @reyoung who noticed that CI needs more features before it can be used to CI Paddle. I crafted this new design in hope of covering these features, namely #3 , #4 , and #5 .

Could you please review it. Thanks!


CI

CI is a continuous integration system designed to run platform-dependent tests in Github repositories.

Architecture

The general architecture is as follows:

arch

Please be aware that all these computers that run CI workers might not be in the same cluster. Actually, they could scatter around the world -- some in an IDC, some in the office, and some at somebody's home.

Configurations

A configuration is defined as a Shell script in the /.ci sub-directory of a Github repo. The name of a configuration is the filename of the script. If the script is supposed to run with Windows, the extension name should be .bat; otherwise, it should be .sh. For example, linux-gpu.sh and windows.bat.

A configuration script can assume that the code has been there, because the CI worker who runs this configuration will check out the code. The script usually builds the project and runs tests.

Assignments

A CI worker should be started with a command line flag to specify the comma-separated configurations it should run. For example

ci -configuration=linux-cpu-only.sh,linux-gpu

Once triggered, the worker checkout the specified commit of the source code from Github, and run configuration scripts. If a worker is assigned multiple configurations, it always runs them in parallel.

Triggering

To configure a Github repo to use CI, we need to set a Webhook in this repo pointing to the URL of the CI master. In cases that we don't have a computer with public/static IP to run CI master, we can use ngrok. Once a user run git push, Github.com invokes the CI master with a JSON payload, which includes the Git commit's SHA ID.

Dispatching

The CI master then triggers all workers via Go RPCs. To let master knows about the workers, we have multiple choices. One is that we use ngrok to give each CI worker some customized URLs named by the configurations that the worker runs. Given that the master could have a chance to clone the Github repo and list configurations in the /.ci sub-directory, it has a chance to list all configurations and invoke those customized URLs. However, a problem is that ngrok.com charges per customized URLs. To minimize the cost while making the system scalable, we would not want to assign each worker a customized URL.

We have an alternative that requires only one customized URL for CI master. In this way, each CI worker, when started, is given the master's customized URL and its own randomly-assigned URL:

ci -configuration=linux-cpu-only.sh,linux-gpu \
   -master=http://paddle-ci.ngrok.com \
   -me=http://abcd1234.ngrok.com

so it can register its own URL to master when it starts. And the master needs to maintain a persistable list of all workers -- whenever a Github Webhook triggers the master, the master triggers all listed workers.

Fault Tolerance

Please be aware that the maintenance of the list of worker URLs is quite fault-tolerable -- even if there are some out-of-date (dead) links in there, the master can still trigger all workers.

Indeed, if the RPC call from the master to a worker's URL fails, the master can mark the URL as dead and remove it from the list.

Also, whenever a worker gets restarted after crashing, it re-registers itself thus adds the active URL to the list maintained by the master.

wangkuiyi commented 8 years ago

Thanks to @emailwei for the kindly remind that the CI computers, if put in the office, need to be separated from the office network for security reasons. So I confirmed with the IT team, they confirmed that we can simply connect CI computers to the Guest wifi. We might have to put Windows computers in the office. We can rent GPU servers from Amazon AWS, so we don't have to put them in the office.

wangkuiyi commented 8 years ago

I second the conclusion from a discussion with @reyoung on WeChat:

The undergoing work in https://github.com/wangkuiyi/ci/pull/6 implements a different idea -- other than merging results from all workers, https://github.com/wangkuiyi/ci/pull/6 assumes that each configuration/worker is an independent CI with its own Webhook.

It is true that maybe in the future we will have too many configurations and thus too many icons for each PR. However, we can merge them into one result/icon by introducing the CI master at that time.

So, let us go on reviewing https://github.com/wangkuiyi/ci/pull/6 and getting it merged.