Open abarciauskas-bgse opened 1 year ago
I was doing it wrong, forgot that I needed to forward along the storage_options to remote_options when creating a reference fs.
I was able to run this recipe locally with the scripts/changes/tests outlined in this gist: https://gist.github.com/abarciauskas-bgse/cff4c77ea841601773d087bc6b45122b
A few questions / next steps:
I was able to run this recipe locally with the scripts/changes/tests outlined in this gist: https://gist.github.com/abarciauskas-bgse/cff4c77ea841601773d087bc6b45122b
Nice 🥳 Can you run it locally full on without --prune
? Have you tried that? That's probably the next step to proceed the ones below. But I imagine we can make a new PR with the changes from the gist
A few questions / next steps:
* [ ] I see that a number of transforms are defined in https://github.com/pangeo-forge/staged-recipes/blob/master/recipes/mursst/recipe.py. Should those be added to pangeo-forge-recipes transforms?
I think all these custom ones have already been merged or are about to by Raphael. Let me review what needs to change and comment/suggest on your new PR when you have it up
* [ ] I think we need to convert to using a file pattern rather than relying on GranuleQuery, although I'm not sure it's absolutely necessary - @ranchodeluxe have you experienced failures due to using GranuleQuery rather than a file pattern?
I think forcing the lookup to happen on the workers instead of the host is ideal yes
* [ ] Run on https://github.com/NASA-IMPACT/veda-pforge-job-runner, perhaps replacing or updating [Failing: MUR SST NASA-IMPACT/veda-pforge-job-runner#15](https://github.com/NASA-IMPACT/veda-pforge-job-runner/issues/15) and the linked code
Yeah, let me run this later today as a prune=True
and prune=False
to see what happens
This issue should probably be on pangeo-forge/staged-recipes, but I've been working on a new branch this week for what I thought would be a relatively straightforward change to the existing MUR SST recipe branch and found that the
lat
andlon
dimensions of the source dataset aren't working with Kerchunk.Current recipe:
The key line here is
inline_threshold=10000
I found that thelat
andlon
dataarrays are about 5kb. So if I pick aninline_threshold
larger than that, kerchunk copies the dataarrays into the references.Without inline:
With inline:
I'm not entirely sure what the best practice is to handle this scenario or how to properly detect that
identical_dims
are not in the desired format, but I'll ask around.Separately, the resulting metadata isn't
consolidated
which I also don't entirely understand and maybe related. I'm working through this and hope to resolve this next week.