leap-stc / cmip6-leap-feedstock

Apache License 2.0
13 stars 5 forks source link

Random vs 'deterministic' data_node selection #160

Open jbusecke opened 6 months ago

jbusecke commented 6 months ago

In the new async client I made the choice of not selecting the data_nodes (if there are several options) from a list of preferred nodes, but just take the first complete one.

My thinking behind this was that it might be good to randomize the sources, in case there is something wrong with a particular of the preferred notes in combination with a certain dataset. I still think that is a good choice overall, but what I noticed in running deployments for #72, is that (to no surprise) does redownload all files (in this case there are a LOT)

image

So this somewhat negates the advantage of a file cache. I think that https://github.com/pangeo-forge/pangeo-forge-recipes/issues/713 will ultimately help with this and give the benefit of not always using the same data node, but for now I am thinking to re-implement the node sorting?

Lets see how https://console.cloud.google.com/dataflow/jobs/us-central1/2024-05-11_06_56_55-9660281334429566451;step=Creating%20CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.so.gn.v20200514%7COpenURLWithFSSpec%7COpenWithXarray%7CPreprocessor%7CStoreToZarr%7CInjectAttrs%7CConsolidateDimensionCoordinates%7CConsolidateMetadata%7CCopy%7CLogging%20to%20bigquery%20%28non-QC%29%7CTestDataset%7CLogging%20to%20bigquery%20%28QC%29;graphView=0?project=leap-pangeo&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22))&authuser=1 goes.