pachyderm / pachyderm

Data-Centric Pipelines and Data Versioning
https://www.pachyderm.com/
Apache License 2.0
6.19k stars 566 forks source link

Service pipeline stops serving static files after new data committed #8375

Open dsgibbons opened 2 years ago

dsgibbons commented 2 years ago

What happened?:

I'm using a pipeline that generates artifacts which are then served by a Flask app embedded in a service pipeline. Some of these artifacts are png images that need to be served from static. To achieve this, I use the following service pipeline:

service.json

{
    "pipeline": {
      "name": "dummy_service"
    },
    "input": {
        "pfs": {
            "glob": "/",
            "repo": "do_stuff"
        }
    },
    "service": {
        "external_port": 30887,
        "internal_port": 8887
    },
    "transform": {
        "cmd": [ "/bin/bash" ],
        "stdin": [
            "ln -s /pfs/do_stuff static",
            "python app.py" 
        ],
        "image": "dummy:0.0.1"
    }
}

where dummy:0.0.1 is created from the following Dockerfile:

Dockerfile

FROM python
RUN pip install flask flask-cors --no-cache-dir 
WORKDIR /app/
COPY app.py .

and app.py is as follows:

app.py

from flask import Flask, url_for, redirect
from flask_cors import CORS

def create_app():
    app = Flask(__name__) 

    CORS(app)

    @app.route("/")
    def home():
        return "home"

    @app.route("/<file>")
    def view_file(file):
        return redirect(url_for("static", filename=f"{file}.png"))

    return app

if __name__ == "__main__":
    app = create_app()
    app.run(host="0.0.0.0", port=8887)

Let's say the do_stuff pipeline processes a datum and produces the output image1.png. Then, I can deploy the above pipeline using:

pachctl create pipeline -f service.json

I can then navigate to 127.0.0.1:30887/image1 and it will successfully display 127.0.0.1:30887/static/image1.png. However, if do_stuff processes a new datum and produces image2.png, then all static locations end up breaking (including image1). Running

pachctl update pipeline -f service.json --reprocess

fixes the problem and allows both files to be served from static.

While broken, the logs show (let pachderm=image1 and elephant=image2):

log

 * Serving Flask app 'app'
 * Debug mode: off
Address already in use
Port 8887 is in use by another program. Either identify and stop that program, or start the server with a different port.
Address already in use
 * Serving Flask app 'app'
 * Debug mode: off
Port 8887 is in use by another program. Either identify and stop that program, or start the server with a different port.
 * Serving Flask app 'app'
 * Debug mode: off
Address already in use
Port 8887 is in use by another program. Either identify and stop that program, or start the server with a different port.
 * Serving Flask app 'app'
 * Debug mode: off
Address already in use
Port 8887 is in use by another program. Either identify and stop that program, or start the server with a different port.
10.42.0.1 - - [16/Nov/2022 22:27:14] "GET / HTTP/1.1" 200 -
10.42.0.1 - - [16/Nov/2022 22:27:19] "GET /pachyderm HTTP/1.1" 302 -
10.42.0.1 - - [16/Nov/2022 22:27:19] "GET /static/pachyderm.png HTTP/1.1" 404 -
10.42.0.1 - - [16/Nov/2022 22:27:36] "GET /elephant HTTP/1.1" 302 -
10.42.0.1 - - [16/Nov/2022 22:27:36] "GET /static/elephant.png HTTP/1.1" 404 -

Note that I am using a symlink in service.json to treat /do_stuff as static. Perhaps this is not the best way to serve files from static? Do you have any other suggestions?

What you expected to happen?:

I should be able to process new datums and serve created artifacts from a downstream static without restarting the service pipeline.

How to reproduce it (as minimally and precisely as possible)?:

Start by downloading some images:

curl https://images.g2crowd.com/uploads/product/image/social_landscape/social_landscape_3cecb217ddd404cf2e71d2612ea5d37f/pachyderm.png --output pachyderm.png
curl https://e7.pngegg.com/pngimages/1001/89/png-clipart-walking-elephant-like-elephant.png --output elephant.png

Then, create a new repo and add one of the png files:

pachctl create repo dummy_data
pachctl put file dummy_data@master -f pachyderm.png

Then, use the following pipeline to do some "processing":

pipeline.json

{
    "pipeline": {
      "name": "do_stuff"
    },
    "input": {
        "pfs": {
            "glob": "/*",
            "repo": "dummy_data"
        }
    },
    "transform": {
        "cmd": ["/bin/bash"],
        "stdin": [
            "cp /pfs/dummy_data/* /pfs/out/"
        ],
        "image": "dummy:0.0.1"            
    }
}

Create the two pipelines:

pachctl create pipeline -f pipeline.json
pachctl create pipeline -f service.json

Navigate to 127.0.0.1:30887 and check that the service is running. "hello" should print at the top of the screen. Then, navigate to 127.0.0.1:30887/pachyderm, which should redirect to 127.0.0.1:30887/static/pachyderm.png and the Pachyderm logo should appear.

Add the next datum:

pachctl put file dummy_data@master -f elephant.png

After the file has been processed, go to either 127.0.0.1:30887/pachyderm or 127.0.0.1:30887/elephant and both links will be broken.

Running

pachctl update pipeline -f service.json --reprocess

fixes both 127.0.0.1:30887/pachyderm or 127.0.0.1:30887/elephant.

Anything else we need to know?:

Environment?:

BOsterbuhr commented 2 years ago

Thanks @dsgibbons for submitting this. I'll talk to the team and we will get back to you shortly.

jrockway commented 2 years ago

An immediate workaround would be to exec python app.py, though we tested that internally and didn't get that to work in every case. It worked with my random version of python3 and flask, but not with someone else's.

The problem here is that we send SIGKILL to bash, not python, and bash just leaves python running when it gets SIGKILL. With exec python, bash "turns into" python, and then we kill that. (But if python spawns additional processes to handle HTTP requests, we can't kill those.)

Longer term, PR #8385 fixes this problem in general. I'll let you know when that's available to test in a nightly release.

dsgibbons commented 2 years ago

Great, thank you very much for that @jrockway. I'll try the short-term fix and see if that helps. I look forward to testing your PR.

jrockway commented 2 years ago

My fix is in v2.5.0-nightly.20221130, which you can install with something like helm install my-pachyderm pachyderm/pachyderm --version 2.5.0-nightly.20221130. Don't use the nightly on your production cluster or anything, you won't be able to downgrade from the nightly build to the stable release. (You might also need a new version of pachctl, which you can grab from here: https://github.com/pachyderm/pachyderm/releases/tag/v2.5.0-nightly.20221130)

I've tested it with a scenario similar to yours, so I think it should fix things, but I'm interested in actual real world feedback of course 😂

dsgibbons commented 2 years ago

Looks good. I'll raise this with my team and see if we can test this fix next week. Thank you!

jrockway commented 2 years ago

That is good to hear. Feel free to @jrockway if you have anything to report!

dsgibbons commented 1 year ago

@jrockway We tried pulling the latest helm chart version, but it only seems to go up to 2.4.1 (which we currently have installed). Is your public helm chart updated to include nightly builds? Are the pachd container images for nightly builds public? If so, where can we find them?

jrockway commented 1 year ago

We update the helm chart and Dockerhub with prerelases. Apparently the chart version contains the commit id, so 2.5.0-nightly.20221205-1f8686882d20250a137667d7a13027e6198da5f5 instead of just 2.5.0-nightly.20221205 is required. "helm repo search pach/pachyderm --devel" will show the latest prerelease. (But the name is too long to copy-paste out of that output, so that's fun.)

I don't know why that's the case. Probably a mistake on our end.

dsgibbons commented 1 year ago

Ok I've finally gotten around to testing this fix. It works in the sense that the service pipeline doesn't go down if a new datum is added. Unfortunately, the new datum is not recognised by the service pipeline - a restart of the service pipeline is required to locate the new datum in the static/ folder. Perhaps symlinking to static/ is not the best way to have the service pipeline update to serve new artefacts? Do you have any suggestions?