Open derekocallaghan opened 2 years ago
Thanks for the useful question Derek!
Once https://github.com/pangeo-forge/pangeo-forge-recipes/issues/376 is complete, it will be possible to specify arbitrary numbers of concat and merge dims.
In the meantime, the suggested workaround is to create separate recipes for the second merge dim.
Thanks Ryan, I'll do that and also watch out for any updates to https://github.com/pangeo-forge/pangeo-forge-recipes/issues/376
@derekocallaghan were you ever able to confirm that this does work in 0.10.0
? If so, perhaps it will make a good tutorial.
Hey @cisaacstern I tried 0.10.0
for multiple Merge Dimension. It breaks the pipeline.
@dataclass
class DetermineSchema(beam.PTransform):
"""Combine many Datasets into a single schema along multiple dimensions.
This is a reduction that produces a singleton PCollection.
:param combine_dims: The dimensions to combine
"""
combine_dims: List[Dimension]
def expand(self, pcoll: beam.PCollection) -> beam.PCollection:
schemas = pcoll | beam.Map(_add_keys(dataset_to_schema))
cdims = self.combine_dims.copy()
while len(cdims) > 0:
last_dim = cdims.pop()
if len(cdims) == 0:
# at this point, we should have a 1D index as our key
schemas = schemas | beam.CombineGlobally(CombineXarraySchemas(last_dim))
else:
schemas = (
schemas
| _NestDim(last_dim)
| beam.CombinePerKey(CombineXarraySchemas(last_dim))
)
return schemas
In above snipped if we run this with 3 combine dimensions (1 Concat
and 2 Merge
In My Case) If fails with below error.
RuntimeError: A transform with label "DetermineSchema/_NestDim" already exists in the pipeline. To apply a transform with a specified label write pvalue | "label" >> transform
@DarshanSP19, thanks for this very helpful report! I think the following might fix this, as the reported error appears to be a case in which Beam doesn't know how to generate a unique name for the unlabeled _NestDim
stage:
if len(cdims) == 0:
# at this point, we should have a 1D index as our key
schemas = schemas | beam.CombineGlobally(CombineXarraySchemas(last_dim))
else:
schemas = (
schemas
- | _NestDim(last_dim)
- | beam.CombinePerKey(CombineXarraySchemas(last_dim))
+ | f"Nest {last_dim.name}" >> _NestDim(last_dim)
+ | f"Combine {last_dim.name}" >> beam.CombinePerKey(CombineXarraySchemas(last_dim))
)
Could I entice you to try that fix, and if it works, submit it as a PR? π
Edit: Added a label to the combine stage as well, as my guess is it will have the same issue once the _NestDim
labeling is resolved. Note also that another problem may surface once we're past this error, but this looks like the right place to start.
@derekocallaghan were you ever able to confirm that this does work in
0.10.0
? If so, perhaps it will make a good tutorial.
Sorry @cisaacstern, I didn't notice your question until today. It's been a while since a looked at the original recipe which required multiple MergeDims, iirc I think I had a workaround that was probably a better approach. I wanted to port this recipe to Beam, so if it's still relevant I'll try your above suggestion if a PR hasn't been created in the meantime.
@DarshanSP19, thanks for this very helpful report! I think the following might fix this, as the reported error appears to be a case in which Beam doesn't know how to generate a unique name for the unlabeled
_NestDim
stage:if len(cdims) == 0: # at this point, we should have a 1D index as our key schemas = schemas | beam.CombineGlobally(CombineXarraySchemas(last_dim)) else: schemas = ( schemas - | _NestDim(last_dim) - | beam.CombinePerKey(CombineXarraySchemas(last_dim)) + | f"Nest {last_dim.name}" >> _NestDim(last_dim) + | f"Combine {last_dim.name}" >> beam.CombinePerKey(CombineXarraySchemas(last_dim)) )
Could I entice you to try that fix, and if it works, submit it as a PR? π
Edit: Added a label to the combine stage as well, as my guess is it will have the same issue once the
_NestDim
labeling is resolved. Note also that another problem may surface once we're past this error, but this looks like the right place to start.
Hey @cisaacstern It worked for me. Happy to do a PR.
Hi,
I'm looking at creating a recipe for CMEMS ASCAT wind data (following the ftp approach in https://github.com/pangeo-forge/staged-recipes/pull/163). ASCAT products are available for all combinations of Metop A/B (REP) and Metop A/B/C (NRT) satellites and ASCending, DEScending passes. I've started with the NRT products, where I was hoping to use one
MergeDim
each for satellite and pass options, something like:However, when creating the recipe, I get the following error in the sandbox:
I wanted to check whether the restriction to a single
MergeDim
is fixed for a particular reason, or can be relaxed to accommodate this scenario?Thanks, Derek