pangeo-forge / pangeo-forge-runner

Run pangeo-forge recipes on Apache Beam
https://pangeo-forge-runner.readthedocs.io
Apache License 2.0
8 stars 9 forks source link

Log fetching #20

Open cisaacstern opened 2 years ago

cisaacstern commented 2 years ago

Reposting @sharkinsspatial's comment from https://github.com/pangeo-forge/registrar/issues/53#issuecomment-1232274752 here, since I believe this will be the best place to discuss this topic going forward:

@cisaacstern I had put together an initial POC with option 2 from above, but based on our discussions the other day I'm going to take a slightly different track that is more aligned with the pangeo-forge-runner approach. Next step include

  1. A PR for pangeo-forge-runner which extends https://github.com/yuvipanda/pangeo-forge-runner/blob/main/pangeo_forge_runner/bakery/dataflow.py to allow requests against a bakery with a job_id and a logs filter pattern and returns a log object.
  2. Extend pangeo-forge-orchestrator with an endpoint which wraps this call.

A few concerns I have with this approach

  1. The potentially large response body depending upon the filter pattern which is applied.
  2. The potential for logs to leak secrets and the lack of security around the pangeo-forge-orchestrator endpoints.

@yuvipanda Does this approach seem structurally ok from the pangeo-forge-runner perspective or should I build out some alternative classes rather than including this directly in the bakery dataflow subclass?

cisaacstern commented 2 years ago

IMHO this path makes a lot of sense, and the questions above are good ones to frame implementation.

To state one perhaps obvious idea, design-wise it makes sense to me for to add a placeholder Bakery.get_logs method to the base bakery, as is done for:

https://github.com/yuvipanda/pangeo-forge-runner/blob/d4ba5045538ac53a7457049acde0d2350f4af37b/pangeo_forge_runner/bakery/base.py#L25

and then have each bakery type override this method with its own implementation. For DataflowBakery, there's already an assumption (though not a requirement) that the gcloud CLI will exist (and be authenticated) in the environment, so perhaps fetching logs via subprocess calls to gcloud would be a good default pathway for DataflowBakery.get_logs?

cisaacstern commented 2 years ago

lack of security around the pangeo-forge-orchestrator endpoints

We do have the ability to lock certain endpoints with authentication headers (and already do so for create/delete/update operations). Certainly we could do so for these endpoints as well (or a subset of them), and let https://pangeo-forge.org authenticate via a next.js server-side call. That being said, this approach doesn't really change the fact that the logs will eventually be publicly queryable (whether the user is querying directly, or with https://pangeo-forge.org as an intermediary).

cisaacstern commented 1 year ago

Just made a few notes on this subject in https://github.com/pangeo-forge/pangeo-forge-orchestrator/pull/150#issuecomment-1269253104. Specifically, the fetch_logs executable as part of that PR (adapted from some code @yuvipanda shared) would be great basis for this feature.