rpeden / prefect-docker-compose

A repository that makes it easy to get up and running with Prefect 2 using Docker Compose.
115 stars 23 forks source link

Use `S3Bucket` from `prefect-aws` instead of `RemoteFileSystem` to access MinIO #5

Open rpeden opened 1 year ago

rpeden commented 1 year ago

RemoteFileSystem is flexible, but its reliance on aiobotocore means it is pinned to an older version of botocore, which sometimes causes API errors when accessing S3 and S3-compatible storage.

Since the S3 Bucket from the prefect-aws collection has excellent MinIO support, this repo should use S3Bucket to access MinIO to ensure optimal developer experience.

tomasrollo commented 1 year ago

In the Prefect UI, how do you specify the MinIO URL when adding the S3 bucket block? Or is it possible when creating the block some other way (CLI, via API)? Thx!

tomasrollo commented 1 year ago

That's what I thought but there does not seem to be a settings section in the S3 block in the GUI?

image
rpeden commented 1 year ago

@tomasrollo There are two different S3 blocks - one is just called S3 and the other is called S3 Bucket:

Screenshot 2023-05-05 at 7 40 55 PM

S3 Bucket has a section for MinIO credentials. The catch is that you need to install the prefect-aws pip package if you want to use S3 Bucket in your own Prefect server.

The easiest way to do this is via the EXTRA_PIP_PACKAGES environment variable. I just updated docker-compose.yml to add a commented-out env variable that will install prefect-aws. If you uncomment that and run the server container, you should see the S3 Bucket block.

Note that I also updated the server container entrypoint; EXTRA_PIP_PACKAGES gets processed in the Prefect container's /opt/prefect/entrypoint.sh script, which wasn't being called before.

So if you want to retrofit EXTRA_PIP_PACKAGES into your existing docker-compose.yml, you will need to update the entrypoint as well.

I hope that helps, but feel free to post again if you run into any problems!

rpeden commented 1 year ago

@tomasrollo I should add that the reason you'd want to use S3 Bucket is that it accepts a MinIOCredentials block that lets you set the server URL.

I hadn't realized the Settings section was removed from the old S3 block. My apologies; I need to update the README.

I think I'll need to just install prefect-aws by default and update the Minio instructions accordingly to show the new block. I'll try to do the README update this weekend.

MrChadMWood commented 1 year ago

Also, whats the correct way to specify more than one library in EXTRA_PIP_PACKAGES?

EDIT: I feel silly now. Well, for anyone like me, just space out the package names inside quotes: - EXTRA_PIP_PACKAGES:"prefect-gitlab prefect-aws"

Docs: you can make use of the EXTRA_PIP_PACKAGES environment variable to install dependencies at runtime. If defined, pip install ${EXTRA_PIP_PACKAGES} is executed before the flow run starts.

MrChadMWood commented 1 year ago

@rpeden

I have an issue where I am able to get prefect-gitlab accessible to the Prefect Server and UI, by using the EXTRA_PIP_PACKAGES option in the docker compose file. This allows to make the gitlab block from the UI.

I attempt to run a flow stored in GitLab, but the Prefect Agent fails, I think because it does not also have access to the prefect-gitlab dependency. I've also checked, and EXTRA_PIP_PACKAGES does not work when specified in the Agent profile of the Docker Compose file ref.

ERROR: KeyError: "No class found for dispatch key 'gitlab-repository'"

What would be the correct way to proceed? I can't seem to find a way to get the agent access to the package it needs.

Edit: Seems Related: https://discourse.prefect.io/t/using-gitlab-block-as-flow-source-with-kubernetes-agent-on-prefect-2/2216/9

rpeden commented 1 year ago

@MrChadMWood the agent uses the same Docker image as everything else, so EXTRA_PIP_PACKAGES should work there, too. I'll give it a quick try and post an update if it works for me.

MrChadMWood commented 1 year ago

@rpeden Thanks. I found a thread in the community forum where it looks like someone experienced the same issue. Please have a look there if you can. There is some discussion and troubleshooting done already.

Also, I managed to get the full traceback in my case:

Flow could not be retrieved from deployment.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/engine.py", line 331, in retrieve_flow_then_begin_flow_run
    flow = await load_flow_from_flow_run(flow_run, client=client)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/client/utilities.py", line 40, in with_injected_client
    return await fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/deployments.py", line 192, in load_flow_from_flow_run
    storage_block = Block._from_block_document(storage_document)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/blocks/core.py", line 618, in _from_block_document
    else cls.get_block_class_from_schema(block_document.block_schema)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/blocks/core.py", line 672, in get_block_class_from_schema
    return cls.get_block_class_from_key(block_schema_to_key(schema))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/blocks/core.py", line 683, in get_block_class_from_key
    return lookup_type(cls, key)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/dispatch.py", line 185, in lookup_type
    raise KeyError(
KeyError: "No class found for dispatch key 'gitlab-repository' in registry for type 'Block'."
rpeden commented 1 year ago

@MrChadMWood it was caused by the way I was overriding the entrypoint for the agent. I'll update it in the repo but this should make EXTRA_PIP_PACKAGES work:

## Prefect Agent
  agent:
    image: prefecthq/prefect:2.10.13-python3.11
    restart: always
    entrypoint: ["/opt/prefect/entrypoint.sh", "prefect", "agent", "start", "-q", "YOUR_WORK_QUEUE_NAME"]
    environment:
      - PREFECT_API_URL=http://server:4200/api
#       Use PREFECT_API_KEY if connecting the agent to Prefect Cloud
      - EXTRA_PIP_PACKAGES=prefect-gitlab
    profiles: ["agent"]

Adding "/opt/prefect/entrypoint.sh" to the entrypoint array makes it work since that's the script the pulls the extra packages.

MrChadMWood commented 1 year ago

@rpeden

I made that adjustment, unfortunately it had no effect in my case.

  ## Prefect Agent
  agent:
    image: prefecthq/prefect:2.10.16-python3.11
    restart: always
    entrypoint: ["/opt/prefect/entrypoint.sh", "prefect", "agent", "start", "-q", "default"]
    environment:
      - PREFECT_API_URL=http://server:4200/api
      - EXTRA_PIP_PACKAGES:prefect-gitlab prefect-aws
    profiles: ["agent"]

Traceback:

Flow could not be retrieved from deployment.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/engine.py", line 331, in retrieve_flow_then_begin_flow_run
    flow = await load_flow_from_flow_run(flow_run, client=client)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/client/utilities.py", line 40, in with_injected_client
    return await fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/deployments.py", line 192, in load_flow_from_flow_run
    storage_block = Block._from_block_document(storage_document)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/blocks/core.py", line 618, in _from_block_document
    else cls.get_block_class_from_schema(block_document.block_schema)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/blocks/core.py", line 672, in get_block_class_from_schema
    return cls.get_block_class_from_key(block_schema_to_key(schema))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/blocks/core.py", line 683, in get_block_class_from_key
    return lookup_type(cls, key)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/dispatch.py", line 185, in lookup_type
    raise KeyError(
KeyError: "No class found for dispatch key 'gitlab-repository' in registry for type 'Block'."

On a seperate note:

In the meantime, I was looking into MinIO as a temporary solution instead of GitLab, where I use the s3-bucket block type. This fails when attempting to make a deployment (despite prefect-aws named inside the EXTRA_PIP_PACKAGES field). Error is the same: KeyError: "No class found for dispatch key 's3-bucket' in registry for type 'Block'."

However, I can get past this error by logging into the CLI and running: pip install prefect-aws && prefect register block -m prefect_aws

Then I try building a deployment to MinIO and this is the new error: botcore.exceptions.NoCredentialsError: Unable to locate credentials

I can't seem to get any code storage working beside local storage.

MrChadMWood commented 1 year ago

I'll share my entire setup, hopefully it helps.

OS: CentOS7 in VMBox Docker Compose:

version: "3.9"
services:

  ### Prefect Database
  database:
    image: postgres:15.2-alpine
    restart: always
    environment:
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
      - POSTGRES_DB=prefect
    expose:
      - 5432
    volumes: 
      - db:/var/lib/postgresql/data
    profiles: ["server"]

  ### MinIO for flow storage
  minio:
    image: minio/minio:latest
    entrypoint: ["minio", "server", "--address", "0.0.0.0:9000", "--console-address", "0.0.0.0:9001", "/data"]
    volumes:
      - "minio:/data"
    ports:
      - 9000:9000
      - 9001:9001
    profiles: ["minio"]

  ### Prefect Server API and UI
  server:
    image: prefecthq/prefect:2.10.16-python3.11
    restart: always
    volumes:
      - prefect:/root/.prefect
    entrypoint: ["/opt/prefect/entrypoint.sh", "prefect", "server", "start"]
    environment:
      - PREFECT_UI_URL=http://HOSTNAME:4200/api
      - PREFECT_API_URL=http://HOSTNAME:4200/api

      - PREFECT_SERVER_API_HOST=0.0.0.0
      - PREFECT_API_DATABASE_CONNECTION_URL=postgresql+asyncpg://postgres:postgres@database:5432/prefect
      - EXTRA_PIP_PACKAGES:prefect-gitlab prefect-aws
    ports:
      - 4200:4200
    depends_on:
      - database
    profiles: ["server"]

  ## Prefect Agent
  agent:
    image: prefecthq/prefect:2.10.16-python3.11
    restart: always
    entrypoint: ["/opt/prefect/entrypoint.sh", "prefect", "agent", "start", "-q", "default"]
    environment:
      - PREFECT_API_URL=http://server:4200/api
      - EXTRA_PIP_PACKAGES:prefect-gitlab prefect-aws
    profiles: ["agent"]

  ### Prefect CLI
  cli:
    image: prefecthq/prefect:2.10.16-python3.11
    entrypoint: "bash"
    working_dir: "/root/flows"
    volumes:
      - "./flows:/root/flows"
    environment:
      - PREFECT_API_URL=http://server:4200/api
    profiles: ["cli"]

volumes:
  prefect:
  db:
  minio:
networks:
  default:
    name: prefect-network

(swapping HOSTNAME with our server domain)

rpeden commented 1 year ago

@MrChadMWood, it looks like your EXTRA_PIP_PACKAGES entries use a colon instead of =. Does it work if you change

- EXTRA_PIP_PACKAGES:prefect-gitlab prefect-aws

to

- EXTRA_PIP_PACKAGES=prefect-gitlab prefect-aws

That's what I'm using, and it is working for me. I recommend changing it for both the server and agent containers to see if it makes a difference.

MrChadMWood commented 1 year ago

@rpeden

That worked! Thanks for the help.

I'm having some trouble with access to GitLab now. Not sure if it's some kind of networking issue just yet. One thing I noticed though, and maybe you can recognize this, is that it isn't displaying the server name in the logs where it says "Downloading flow code...":

Downloading flow code from storage at ''

Flow could not be retrieved from deployment.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/engine.py", line 331, in retrieve_flow_then_begin_flow_run
    flow = await load_flow_from_flow_run(flow_run, client=client)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/client/utilities.py", line 40, in with_injected_client
    return await fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/deployments.py", line 198, in load_flow_from_flow_run
    await storage_block.get_directory(from_path=deployment.path, local_path=".")
  File "/usr/local/lib/python3.11/site-packages/prefect_gitlab/repositories.py", line 180, in get_directory
    raise OSError(f"Failed to pull from remote:\n {err_stream.read()}")
OSError: Failed to pull from remote:
 Cloning into '/tmp/tmpn7tjlw71prefect'...
fatal: unable to access 'https://gitlab.private.org/chad.wood/reporting-automations/': Failed to connect to gitlab.private.org port 443: Connection timed out

Would you happen to have any idea what's going on?

Edit: I don't think its a connection issue. I was able to ping the server from the CLI. image

I'll keep investigating and post back when I find something.

MrChadMWood commented 1 year ago

Regarding that last message, the issue clears up when I run on a different machine. I'm not sure what was going on, but clearly I think it was either networking or config related. Thanks a lot for your help.

Edit: For the sake of accuracy, I did not just run on a different machine. I also created the block as described in this thread. Not sure if that's really what made a difference though. Previously, I was creating the GitLab Credentials block and the GitLab Repository block separately via the UI, and referencing the Credentials block from within the Repository block.

rpeden commented 1 year ago

It looks like GitLab was close to working. I'm not sure why it was timing out, but it might be that the git clone call that the GitLab Repository block runs isn't compatible with self-hosted GitLab instances (which it looks like yours is).

Are you able to clone the repo inside a Prefect CLI container if you run something like git clone https://oauth2:<access token@gitlab.mydomain.com/user/repository.git? The oauth2:<token>@ part before the domain name is what might be breaking things. I know it works for repositories hosted on gitlab.com.

Apologies for any slow replies here; I don't work at Prefect anymore, so I don't have as much time to work on this repository as I used to. But I still want to make sure it works as well as possible for everyone 😄

MrChadMWood commented 1 year ago

Just checked, cloning works! and I'm Glad to have any help at all :) , don't stress the response times.
edit: Cloning did not work. It was hanging and I did not notice. Sorry to be misleading.

I think the issue had to do with the VM I was using. Running on another machine now, and it works. Only thing I did different was create the block in the CLI like so:

python

>>> from prefect_gitlab.credentials import GitLabCredentials
>>> from prefect_gitlab.repositories import GitLabRepository
>>> 
>>> gitlab_repo = GitLabRepository(
>>>    repository="https://gitlab.com/annageller/prefect.git",
>>>    reference="main",
>>>    credentials=GitLabCredentials(token="GITLAB_ACCESS_TOKEN"))
>>> gitlab_repo.save("default", overwrite=True)

source

In prior failed attempts, the credentials and repository were separate blocks, both of which made from the UI. Again, not sure if this would make any difference though.