Long-term goal: rogue workshop

nbren12 commented 4 years ago

@jhamman had the great idea in today's meeting of organizing an independent workshop on pangeo + ML to occur towards the end of the year.

I think this is a great opportunity to focus our thoughts into a coherent story, and recommend some potent infrastructure/know-how combinations for ML research in the geosciences.

Other workshops have mostly focused on clean ML datasets, but this workshop could focus on producing them. We said something like "Constructing ML Pipelines" would be natural title.

My $0.02 is that the best practice will depend on the organizational/team context. For example, I sometimes dream at night about using a filesystem like GLADE, but my team doesn't have access to that kind of machine. While it would be great to emphasize a common toolkit, I think we should point out divergence points and make strong suggestions.

As a start, it would be great to gather some brief impressions about the ML pipelines this group is building. I'm including my own answers as a guide below:

Where is the pipeline run (HPC, cloud, or mixed)? Google Cloud Platform.
What system(s) are intermediate steps stored on? (e.g. shared filesystem, cloud storage, SQL database) Google cloud storage
If applicable, what is format of the intermediate steps? netCDFs, zarrs, pickle files.
What are the main "parallelization" engines of your workflow if any? Google Cloud Dataflow (apache beam), K8s jobs
How many chained processing steps are needed to produce the ML dataset? How are these steps orchestrated (e.g. manual, snakemake, airflow, argo, etc)? Less than 10. Orchestrated using a mix of methods including argo and custom scripting systems. Both launch/manage pods on a K8s cluster.
How are software dependencies managed? Docker containers with anaconda inside of them.
Size of input, output data, intermediate data? 10s of TBs of input, 10s of GBs of final processed data. TBs of intermediate.

raspstephan commented 4 years ago

This sounds like a great idea. Particularly focusing on the data pipeline up to the ML model. One small comment: I think also catering to less high-performance audiences would probably be helpful to a lot of people. My workflow, and I imagine that of many other people, is currently all on a local server, far away from the pipeline @nbren12 is building. I personally would find it helpful to have sessions on deciding whether it makes sense to port my workflow to the cloud and how to get started (for cloud-noobs like me). It might also be interesting to talk about the workflow from model/observation netCDF to keras/pytorch dataloader.

Here is a rough outline of my current workflow:

Download data from public servers (ERA, TIGGE, CMIP) in netCDF onto local server
Regrid data (1 and 2 are done via snakemake with xesmf for regridding) --> Total amount, 100s of GB to some TB (depending on resolution)
(Optional) Convert data to TFRecord with preshuffling to avoid CPU RAM limitations (reading raw netCDFs from disk is too slow!)
Load NetCDFs (using xarray) or TFRecords using Keras Dataloader (whose complexity got totally out of hand...)

nbren12 commented 4 years ago

Thanks for sharing! Indeed, I think this model is probably optimal for a single researcher and can easily be replicated by with a single instance on the cloud. This is what I did at UW and why I said this

My $0.02 is that the best practice will depend on the organizational/team context

However, I would disagree that automated infrastructure is only for "high-performance audiences". The single server model scales poorly for groups larger than 1, and a more automated approach to infrastructure results in more reproducible research IMO. Here is a slide I recently made on the subject:

This becomes even more important for complicated ML pipelines, so I think it's important to communicate these trade-offs.

raspstephan commented 4 years ago

Well that's great. The workshop could be a perfect opportunity then to teach plebs like myself, who are scared of the cloud, how to scale up!

nbren12 commented 4 years ago

plebs like myself

pfssh...not sure how many plebs can use tfrecords...

jbednar commented 4 years ago

Here is a slide I recently made on the subject:

That looks good, though I'd split the first stage "Local laptop/VM" splits into two varieties: (a) "unreproducible environment" and (b) "locally reproducible environment". Those two cases differ depending on whether results were run on whatever conda or pip packages happened to be installed, which itself is the result of some complex and unknown history of installations over time (b), or whether the environment has been captured in a pinned, reproducible, and hopefully minimal way (a). I think most results come from an environment of type (b), and I think the biggest increase in reproducibility comes from going from (b) to (a), because the conda or pip dependencies are generally the most specific to data science, the most quickly changing, and the most likely to affect the results, compared to all the other libraries on the system. After going from (b) to (a), then going to Docker or Kubernetes/CI achieves further reproducibility, but it's not as big a jump as simply pinning to make an environment reproducible locally...

nbren12 commented 4 years ago

@jbednar I agree. (a) to (b) is a quantum leap. Unfortunately, its hard to verify if someone else’s software project is of type (a) or (b) without CI. I’m not sure if CI is something vital to ML pipeline development though. Maybe 50-50 reproduciblity is close enough...

jbednar commented 4 years ago

Using CI to force an escape from Schrodinger's reproducibility! :-) In practice we too use CI to ensure that we're in case (a) and not case (b) (see examples.pyviz.org), but it's at least possible to do the same by just passing it to another colleague...

nbren12 commented 4 years ago

at least possible to do the same by just passing it to another colleague

Sometimes it's easier to be friends with Travis, haha!

jsadler2 commented 4 years ago

I like the idea of this workshop and I think it would be useful - especially since so much of our time is spent on the data prep steps compared to the actual ML modeling.

Here are my answers to your questions:

Where is the pipeline run (HPC, cloud, or mixed)? HPC
What system(s) are intermediate steps stored on? (e.g. shared filesystem, cloud storage, SQL database) HPC shared filesystem
If applicable, what is format of the intermediate steps? zarr, npz
What are the main "parallelization" engines of your workflow if any? snakemake submits jobs to slurm scheduler
How many chained processing steps are needed to produce the ML dataset? How are these steps orchestrated (e.g. manual, snakemake, airflow, argo, etc)? Less than 10. Most of the orchestration is just in a python function, but some using snakemake.
How are software dependencies managed? conda
Size of input, output data, intermediate data? Megabytes

nbren12 commented 4 years ago

Thanks for sharing Jeff.

This is unrelated, but I've been thinking a lot lately about the concept of MLOps. IMO, the devops world is pretty far ahead of the scientific community when it comes to reproducibility, since lack of reproducing has much higher consequences in the commercial world. I wonder if any of it translates to the academic context?

Edit: corrected name

djgagne commented 4 years ago

I want to follow up on a short discussion we had at today's Pangeo ML group meeting. There is still interest in the workshop but no one has volunteered to take the lead on organizing the event, likely because of the time commitment involved. We also discussed the scope of the workshop, which could be very wide ranging but would ideally focus on some essentials. Two questions for the group:

Who has an interest in organizing this workshop? Alternatively, do you know anyone who might be interested if presented the opportunity? They would mainly need to handle logistics, send out reminders, and ask people in this group for guidance/talks/etc as needed.
What are the most common questions people have been getting about ML infrastructure building with Python/pangeo?
What are the biggest headaches/time wasters people keep running into in their pipelines?

I hope the answers to these questions help us focus priorities for the workshop. December is not going to be a realistic date at this point, but spring may be a promising time, especially if 1/2 days and virtual.

zhonghua-zheng commented 4 years ago

I want to follow up on a short discussion we had at today's Pangeo ML group meeting. There is still interest in the workshop but no one has volunteered to take the lead on organizing the event, likely because of the time commitment involved. We also discussed the scope of the workshop, which could be very wide ranging but would ideally focus on some essentials. Two questions for the group:

Who has an interest in organizing this workshop? Alternatively, do you know anyone who might be interested if presented the opportunity? They would mainly need to handle logistics, send out reminders, and ask people in this group for guidance/talks/etc as needed.

What are the most common questions people have been getting about ML infrastructure building with Python/pangeo?

What are the biggest headaches/time wasters people keep running into in their pipelines?

I hope the answers to these questions help us focus priorities for the workshop. December is not going to be a realistic date at this point, but spring may be a promising time, especially if 1/2 days and virtual.

Hi @djgagne , although I don't have sufficient experience in organizing the workshop, I am interested in assisting the workshop! Please feel free to let me know if there is anything that I can contribute to.

pangeo-data / ml-workflow-examples

Long-term goal: rogue workshop #15