pangeo-data / pangeo-datastore

Pangeo Cloud Datastore
https://catalog.pangeo.io
48 stars 16 forks source link

Add some ERA5 data? #49

Closed sebastienlanglois closed 3 years ago

sebastienlanglois commented 5 years ago

Hi everyone,

I'm a hydrological engineer who's been using xarray/dask for quite a while now and I was very pleased when I stumbled upon the pangeo project a couple of weeks ago!

I have downloaded about 25 variables from the ERA5 dataset from 1979 to 2019 and would be happy to upload them to the pangeo-data bucket in the zarr format. I was wondering what was the policy regarding external contribution of datasets as I'm not (yet) a contributor of this project.

Thank you!

rabernat commented 5 years ago

Hi @sebastienlanglois - thanks for your message. It is absolutely a possibility! We welcome anyone who wants to share analysis-ready data.

Regarding ERA5, many people are interested in this dataset, and, working with ECMWF (cc @StephanSiemen), we are trying to come up with a coordinated strategy. I know @jhamman is also interested in this.

Could you provide a list of the 25 variables? And explain some more details about how you have processed the data?

Let's have some discussion about whether this is the best way forward for ERA5 in the cloud.

jhamman commented 5 years ago

ping @aluhamaa, @dobrych, @etoodu from PlanetOS/Intertrust. I know they have been working on getting a zarr archive of ERA5 setup.

cc @jflasher

sebastienlanglois commented 5 years ago

This is a list of all hourly ERA5 single levels reanalysis variables I have so far for the entire globe

This amounts to 17 variables which are about 25-26GB/month in size for a total of 12-13TB for the entire period (1979-2019).

As stated in my previous message, I also have the following variables but forgot to mention that it's only for the Northeastern region of North America. I suspect that there is no interest in uploading regional subsets of the ERA5 dataset so this might not be useful for the pangeo-data bucket.

The data is currently stored on a private server in the original netcdf format provided by the cds api. (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels?tab=form)

Each netcdf file contains one month of data. Here is the metadata for one netcdf file: image

I've started to upload the 17 variables for the entire globe on a Wasabi Cloud Storage server to experiment the zarr format using default zarr compressor. Once the upload is completed and validated, perhaps I could transfer it to the pangeo-data bucket directly from the wasabi server?

Also, because ERA5 is a PB-dataset and this is only single levels reanalysis data, perhaps the data could be stored in the following way (or something similar) : ERA5/single-levels/reanalysis.

This would allow for some flexibility if there is an interest later on to add pressure levels or ensemble members.

I'm also able to download more variables/product types if there is a need for them! Thanks again for this project and please tell me if I can contribute with what I have so far!

dobrych commented 5 years ago

@jhamman Zarr is still on our list, Joe. Thanks for tagging us here. We have a few priority issues to tackle first. We will share some estimates here soon.

rabernat commented 5 years ago

Thanks for sharing your datasets @sebastienlanglois. I am inclined to accept your global data into our archive as a temporary measure, as it will be very useful for the upcoming CMIP6 Hackathon.

For the long term, we need to work with ECMWF to do properly. In particular, we need a path for updating the archive as new data becomes available. We don't want that to be a manual process.

I've started to upload the 17 variables for the entire globe on a Wasabi Cloud Storage server to experiment the zarr format using default zarr compressor.

This sounds very interesting! I'd love to experiment with Wasabi. I learned about them at the Internet2 meeting. Their storage is by far the cheapest out there.

Could you share the uri to some of the zarr datasets on wasabi? I'm curious what the performance is like if I read them from Google Cloud.

Once we are happy with the format, we should be able to use rclone to do the transfer from one cloud to another.

sebastienlanglois commented 5 years ago

Great! I'm glad that the ERA5 dataset will be made available step by step for the community. I also agree for the long term, as ERA5 reanalysis is being updated monthly, it makes more sense that the data is provided directly from ECMWF.

About Wasabi, it is indeed really cheap compared to competitors. Also, a huge selling point for me is that they don't charge egress fees which can usually add up pretty quickly when working with large datasets. The downside is they lack tools and have limited documentation and support compared to the mainstream cloud service providers. However, they provide an S3-compliant interface which means that most s3 utilities/libraries are compatible with the right configuration.

This is the uri where I have one year worth's of data in the zarr format : s3://era5-single-reanalysis Notice however that you need to configure the endpoint url to https://s3.wasabisys.com so that it redirects to Wasabi's servers instead of AWS's.

I prefer to use the following code instead to access the data in python :

import xarray as xr
import s3fs
import zarr

# Wasabi cloud storage configurations
client_kwargs={'endpoint_url': 'https://s3.wasabisys.com',
               'region_name':'us-east-1'}
config_kwargs = {'max_pool_connections': 30}

s3 = s3fs.S3FileSystem(client_kwargs=client_kwargs, 
                       config_kwargs=config_kwargs, 
                       anon=True)  # public read
store = s3fs.S3Map(root='era5-single-reanalysis',
                   s3=s3,
                   check=False)

with xr.open_zarr(store) as ds:  
    print(ds)
    # do stuff...
rabernat commented 5 years ago

Fantastic! I have some comments.

Your dataset looks like this:

<xarray.Dataset>
Dimensions:    (latitude: 721, longitude: 1440, time: 8760)
Coordinates:
  * latitude   (latitude) float32 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
  * longitude  (longitude) float32 0.0 0.25 0.5 0.75 ... 359.25 359.5 359.75
  * time       (time) datetime64[ns] 1979-01-01 ... 1979-12-31T23:00:00
Data variables:
    asn        (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(744, 100, 100)>
    d2m        (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(744, 100, 100)>
    e          (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(744, 100, 100)>
    mn2t       (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(744, 100, 100)>
    mx2t       (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(744, 100, 100)>
    ptype      (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(744, 100, 100)>
    ro         (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(744, 100, 100)>
    sd         (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(744, 100, 100)>
    sro        (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(744, 100, 100)>
    ssr        (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(744, 100, 100)>
    t2m        (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(744, 100, 100)>
    tcc        (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(744, 100, 100)>
    tcrw       (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(744, 100, 100)>
    tp         (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(744, 100, 100)>
    tsn        (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(744, 100, 100)>
    u10        (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(744, 100, 100)>
    v10        (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(744, 100, 100)>

Your chunk shape is (744, 100, 100), i.e. chunked in both space and time.

Most of our other datasets are contiguous in space and chunked in time. This often, but not always, aligns with the most common use cases. With you current chunking scheme, making a map of a field for a single time snapshot, perhaps the most common operation, requires reading 3 GB of data.

Is there a reason you chose this chunk structure?

rabernat commented 5 years ago

Btw, I timed the transfer rate; using 10 dask cores, I got a throughput of about 1GB/s to google cloud us-central region. This is not bad!

sebastienlanglois commented 5 years ago

The chunk shape was a compromise between space and time. In my domain, most of our use cases involve a small/medium region with the longuest time serie. Then again, the initial chunk (744, 100, 100) is somewhat arbitrary and I'm totally ok with changing it to accomodate the largest pool of users. Do you have any suggestions ? I can rechunk them and upload one year of data to a new bucket to continue our testing.

1GB/s is indeed pretty good!

rabernat commented 5 years ago

If I were creating the dataset, I would probably do

ds = ds.chunk({'time': 'auto', 'latitude': -1, 'longitude': -1})

However, we may want to consider actually storing both versions, since they are useful for different use cases.

rabernat commented 5 years ago

Just for reference.


# don't use dask, to simplify timing 
ds = xr.open_zarr(store, consolidated=True, chunks=None)

%time ds.asn[:, 0, 0].load()
# CPU times: user 181 ms, sys: 126 ms, total: 308 ms
# Wall time: 1.75 s

%time ds.asn[0, :, :].load()
# CPU times: user 1.68 s, sys: 742 ms, total: 2.42 s
# Wall time: 12 s
sebastienlanglois commented 5 years ago

@rabernat I've created a new zarr store that was chunked according to your recommandation. The final chunksize is (31, 721, 1440). The new bucket is called era5-single-reanalysis-space-opt and is accessible from wasabi as the previous one was.

As expected, the performance is much better for a single time snapshot:

%time ds.d2m[0, :, :].load()
# CPU times: user 336 ms, sys: 335 ms, total: 671 ms
# Wall time: 1.76 s

On the other hand,

%time ds.d2m[:, 0, 0].load()

seems to take forever.

In the end, the differences in performance between the 2 stores are significant so it might indeed be a good idea to store both versions (optimised for space vs time). We can maybe look into making the original (744, 100, 100) chunk a little bigger. However, as ECMWF updates ERA5 monthly, I don't suggest we go over 744 (one month's worth of data) in the time dimension.

rabernat commented 5 years ago

Sorry for letting this hang. I think we should move ahead with ingesting the ERA dataset into google cloud. @sebastienlanglois - can you give me your email?

sebastienlanglois commented 5 years ago

@rabernat , you can send me an email at the following adress : sebastien.langlois@polymtl.ca

rabernat commented 5 years ago

I created a bucket for you and granted you storage admin on it:

gsutil mb -c standard -l us-central1 -b on gs://pangeo-era5/
gsutil iam ch user:sebastien.langlois@polymtl.ca:roles/storage.objectAdmin gs://pangeo-era5/
gsutil iam ch allUsers:objectViewer gs://pangeo-era5/

Go nuts! 😄

rabernat commented 5 years ago

@sebastienlanglois - did you manage to upload your data

sebastienlanglois commented 5 years ago

Hi @rabernat, I plan on starting the upload beginning next week (October 21st) as I am currently on a trip and have limited access to my work environment.

However, I do have a couple of questions/points :

  1. Because the complete ERA5 is such a huge dataset, I think we should anticipate that we might want to add other product types or the pressure levels along the road. If that is the case, then we could partition the bucket with the following logic :

gs://pangeo-era5/LEVELS/PRODUCT TYPE

  1. We discussed briefly about storing 2 different zarr versions of ERA5 which would be optimally chunked for queries along space or time dimension. Like I stated in another post, I was unable to run the following code
%time ds.d2m[:, 0, 0].load()

with chunks as :

ds = ds.chunk({'time': 'auto', 'latitude': -1, 'longitude': -1})

because that would mean loading all the chunks

To accommodate all use cases, we could use both theses chunks :

ds = ds.chunk({'time': 'auto', 'latitude': -1, 'longitude': -1}) # optimal for queries along space dimension
ds = ds.chunk({'time': '744', 'latitude': 100, 'longitude': 100}) # optimal for queries along time dimension

That could translate into this final logic for the bucket :

gs://pangeo-era5/LEVELS/PRODUCT TYPE/DIM-OPT

Thoughts ?

sebastienlanglois commented 5 years ago

Just got back and I'm working on the ERA5 dataset right now. I am completing the last conversions into the zarr format (chunked optimally for queries in the space dimension at first). I currently have one zarr directory for each year. I was originally planning on uploading directly to the cloud one year at a time by using the append_dim='time' argument from xr.Dataset.to_zarr() method. However, I'm afraid that the connection might drop at some point before the transfer is completed. So instead I will be merging all the dataset into one zarr directory locally and upload it via gsutil.

As I have about 50 mbps dedicated in upload speed, it might take up to 2-4 weeks to get both datasets (optimal for space vs. time queries) online. Sorry for the delay but we're almost there!

Also, thinking long term, what is the best strategy to append data as it is being made available? I understand that xr.Dataset.to_zarr(..., append_dim='time' ) can do the job but comes at a great risk of not completing the transfer if the connection drops. For testing purposes, I have converted a dataset to the zarr format by intentionally using the to_zarr() method two times in a row to simulate a case where the connection drops and the whole dataset is converted again. Even if the data is the same, it is not overwritten and my final zarr store becomes twice as big because it has duplicates. How can we avoid that risk when appending to large datasets already in the cloud?

rabernat commented 5 years ago

Hi @sebastienlanglois. For this project, I think we should just focus on a single snapshot of the data. As you stated, the updating problem is harder. IMO, it should really be handled by ECMWF itself. That's why I hope @jwagemann, @StephanSiemen, etc. will chime in and let us know their thoughts.

jwagemann commented 5 years ago

Hi @rabernat and @sebastienlanglois , from next week on I will be able to start as well on the already discussed project to set up a Pangeo cluster with ERA5 data on AWS. Since there is already other activities from @sebastienlanglois and as I remember Intertrust technologies, who are in charge of hosting a subset of ERA5 on AWS, are working currently as well to convert the data to zarr, I suggest to have maybe a call to see how we can bring all the activities best together? It does not make sense to reinvent the wheel. I also suggest not to focus on the entire ERA5 archive, but focus at the beginning on a subset of the most popular parameters.

rabernat commented 5 years ago

Hi @jwagemann! Thanks for your reply!

I think a planning call would be a great idea. I am extremely busy right now with some proposal deadlines approaching. Perhaps we could aim for the first week of December? Would that work for you?

Are we committed 100% to AWS? Much of the rest of our data, including CMIP6 is in Google Cloud. Moving the data across clouds can be expensive. We have some ideas to deal with this, so perhaps it's not a big deal.

jwagemann commented 5 years ago

First week of December is perfect for me, @rabernat ! I am fairly flexible and you can suggest a day / time that suits you. We will use AWS for now as we have the resources in form of cloud credits. However, ECMWF is setting up its own cloud at the moment and there is certainly interest to investigate options to port between systems, once it is mature enough.

rabernat commented 5 years ago

I created a poll to find a suitable time. This will be hard because we are spanning many time zones. If the options I included don't work, we can try again: https://www.when2meet.com/?8390372-cBEOl

rabernat commented 5 years ago

@chiaral - maybe you are interested in joining this as well?

chiaral commented 5 years ago

Hello all! I am extremely interested in ERA5 data on model levels! For reference, ERA5 on pressure levels are available here

I am interested (unfortunately!) on 3D variables (q,t,h,u,v) because I need to calculate quantities such as CAPE, SRH, that need the whole column. But I understand if those variables might not be the first to be made available, because so large. I will join the call. I am very excited to see ecmwf exploring alternatives for the users to retrieve ERA5 data (especially at model levels). I know of people that took many months to download them, and some that have just gave up and turned to the ncar/ucar/rda pressure level data.

rabernat commented 5 years ago

Pinging @aluhamaa, @dobrych, @etoodu from PlanetOS/Intertrust and @jflasher from AWS in case they want to join the call.

jwagemann commented 4 years ago

Is it save to say that the call will be Wednesday, 4 Dec at 9 pm CET?

aluhamaa commented 4 years ago

@jwagemann I would like to join the call, 4 Dec at 9 pm CET is fine.

rabernat commented 4 years ago

Ryan Abernathey is inviting you to a scheduled Zoom meeting.

Topic: Pangeo ERA5 Cloud Discussion Time: Dec 4, 2019 03:00 PM Eastern Time (US and Canada)

Join Zoom Meeting https://columbiauniversity.zoom.us/j/926420868

Meeting ID: 926 420 868

One tap mobile +16468769923,,926420868# US (New York) +16699006833,,926420868# US (San Jose)

Dial by your location +1 646 876 9923 US (New York) +1 669 900 6833 US (San Jose) Meeting ID: 926 420 868 Find your local number: https://columbiauniversity.zoom.us/u/abcuAxbTMF

Join by SIP 926420868@zoomcrc.com

Join by H.323 162.255.37.11 (US West) 162.255.36.11 (US East) 221.122.88.195 (China) 115.114.131.7 (India) 213.19.144.110 (EMEA) 103.122.166.55 (Australia) 209.9.211.110 (Hong Kong) 64.211.144.160 (Brazil) 69.174.57.160 (Canada) 207.226.132.110 (Japan) Meeting ID: 926 420 868

rabernat commented 4 years ago

I have created a preliminary agenda here: https://docs.google.com/document/d/1eIOLp43xCNUMuMx9wdcmi276xyswVYwkia_TT9lIBNE/edit?usp=sharing

See you all in a few minutes.

rabernat commented 4 years ago

Thanks to all who participated in the call today.

I created a new repo where we can collaborate on the ERA5 --> cloud storage pipeline: https://github.com/pangeo-data/pangeo-era5.

I made some interesting experiments with lazy loading via the CDS API, which I describe a bit here: https://github.com/pangeo-data/pangeo-era5/issues/1

rabernat commented 4 years ago

Happy new year everyone. I thought I would just ping this issue to see if there is anything we can do to help move the ERA5 cloud project forward.

sebastienlanglois commented 4 years ago

Happy new year to everyone also!

17 variables of ERA5 single levels are have been added in the bucket below. The chunks are optimized for spatial analysis. We can look into creating another chunked dataset for time-series analysis as we've discussed previously if people are interested.

Also, I have noticed that some variables are empty for the first 7 hours of the entire dataset in the zarr format (while they were fine in the original netcdfs). While it is negligeable compared to the whole dataset, it is still somewhat inconvenient. I think some methods in xarray to convert netcdfs to zarr do not always work as expected especially with large datasets. I had some difficulties with the append mode in xr.Dataset.to_zarr() which resulted in some missing values in the final .zarr despite the original dataset having all the values. My most succesful attempt used xr.open_mfdataset before converting to the zarr format. **Edit : After verification, the missing 7 hours are normal for some variables in the ERA5 dataset.

Anyways, have fun exploring ERA5!

import gcsfs
import fsspec
import xarray as xr

with xr.open_zarr(fsspec.get_mapper('gcs://pangeo-era5/reanalysis/spatial-analysis'),
                  consolidated=True,
                  chunks='auto') as ds:
    print(ds)
rabernat commented 4 years ago

Fantastic! Thanks @sebastienlanglois! Could you consider making a PR to add these datasets to the intake catalog which lives in this repo?

sebastienlanglois commented 4 years ago

Yes @rabernat I can make a PR to add era5 to the intake catalog.

Btw, I looked into it a little more and it appears that some variables indeed begin from 1979-1-01-01-00:00UTC while some other at 1979-01-01-07:00UTC. For instance, total precipitation is such variable. This can be observed on the cds webpage when making a request for that date (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels?tab=form).

rabernat commented 4 years ago

it appears that some variables indeed begin from 1979-1-01-01-00:00UTC while some other at 1979-01-01-07:00UTC.

Xarray can easily handle these alignment issues downstream.

rabernat commented 4 years ago

@sebastienlanglois - this catalog gives a good template for how an ERA5 intake catalog might look. Let me know if I can give you any guidance! I'm excited to get this dataset into the catalog so users can access it!

dgergel commented 4 years ago

@sebastienlanglois - was just looking at the ERA-5 data you've added to the bucket. Do you plan to add hourly data since 2018, or daily ERA-5 data?

sebastienlanglois commented 4 years ago

@dgergel - yes it is my intention to maintain the bucket updated, however I would like to automate the pipeline process rather than appending the data manually.

I've just referenced another issue that address partly this subject (https://github.com/pangeo-data/pangeo-datastore/issues/72)

rabernat commented 4 years ago

@sebastienlanglois has just merged a catalog entry in #89 pointing to the dataset he created. Also browseable here: https://catalog.pangeo.io/atmosphere/era5_hourly_reanalysis_single_levels_sa

Thanks so much @sebastienlanglois for all your work to make this happen! There is still much to do of course, but it's a great start.

If anyone wants to write a blog post about this and / or create a binder to show how to access the data, you are more than welcome to use the Pangeo channels to share this.

sebastienlanglois commented 4 years ago

I will work on creating a binder for ERA5's dataset.

aluhamaa commented 4 years ago

Hi In case you happen to be interested, there is a small subset of ERA5 near-surface variables on AWS now under the path s3://era5-pds/zarr/ There is a different zarr object for each month and each variable Simple sample notebook https://github.com/planet-os/notebooks/blob/master/api-examples/ERA5_zarr_example.ipynb

Best regards Andres Luhamaa

On Sat, Oct 26, 2019 at 3:24 AM sebastienlanglois notifications@github.com wrote:

Just got back and I'm working on the ERA5 dataset right now. I am completing the last conversions into the zarr format (chunked optimally for queries in the space dimension at first). I currently have one zarr directory for each year. I was originally planning on uploading directly to the cloud one year at a time by using the append_dim='time' argument from xr.Dataset.to_zarr() method. However, I'm afraid that the connection might drop at some point before the transfer is completed. So instead I will be merging all the dataset into one zarr directory locally and upload it via gsutil.

As I will have about 50 mbps dedicated in upload speed, it might take up to 2-4 weeks to get both datasets (optimal for space vs. time queries) online. Sorry for the delay but we're almost there!

Also, thinking long term, what is the best strategy to append data as it is being made available? I understand that xr.Dataset.to_zarr(..., append_dim='time' ) can do the job but comes at a great risk of not completing the transfer if the connection drops. For testing purposes, I have converted a dataset to the zarr format by intentionally using the to_zarr() method two times in a row to simulate a connection drop. Even if the data is the same, it is not overwritten and my final zarr store becomes twice as big because it has duplicates. How can we avoid that risk when appending to large datasets already in the cloud?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-datastore/issues/49?email_source=notifications&email_token=ACRMQUSXVSZCF25XWGORGFDQQOL3BA5CNFSM4IWL2GO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECJ4QKI#issuecomment-546555945, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRMQUSYNX5MLE7ZL5TA6N3QQOL3BANCNFSM4IWL2GOQ .

raspstephan commented 4 years ago

First of all thanks to @sebastienlanglois and @rabernat for putting the data on there. However, as described above

On the other hand,

%time ds.d2m[:, 0, 0].load() seems to take forever.

doing temporal analysis with the small temporal chunks is nearly impossible. For my use case, I would be very interested in rechunking as @sebastienlanglois describes. I would be happy to lead this as well.

BTW, how large is the dataset that is currently stored?

rabernat commented 4 years ago

For my use case, I would be very interested in rechunking as @sebastienlanglois describes. I would be happy to lead this as well.

Yes, I think we want to make a temporary rechunked copy of the dataset for temporal analysis. That's what rechunker is for:

You will need write access to a bucket. On pangeo cluster, you can use the scratch bucket.

BTW, how large is the dataset that is currently stored?

According to the data catalog (https://catalog.pangeo.io/browse/master/atmosphere/era5_hourly_reanalysis_single_levels_sa/), each variable is about 1.7 TB uncompressed, and there are 16 variables.

sebastienlanglois commented 4 years ago

I'm also still interested in a rechunked version of this dataset for temporal analysis.

Would it be useful to update it before rechunking the dataset? On my end, I've been hesitant to keep it updated not because of time constraint but mostly because I'm not sure of the proper way to do it. My understanding is that I would need to copy the dataset into a temporary (or scratch) bucket, append the dataset with new data and then transfer all back to the original bucket. Appending data directly into the ERA5 bucket seems too dangerous should an error happen. Keeping the dataset's size in mind, I'm unsure if that approach would be too costly.

Curious to hear some thoughts on this.

Btw, kudos for rechunker, it's a real game changer!

raspstephan commented 4 years ago

@sebastienlanglois Let me know if I can help in any way to speed this up. This suddenly became very high on my priority list.

I would also love to have 2019 on there.

rabernat commented 4 years ago

Would it be useful to update it before rechunking the dataset? On my end, I've been hesitant to keep it updated not because of time constraint but mostly because I'm not sure of the proper way to do it.

The "proper way" to do it doesn't exist yet. That's what Pangeo Forge will solve!

Until Pangeo Forge is ready, we just use an ad-hoc process. @sebastienlanglois - I have no problem with you appending to the existing zarr dataset. It should work using either low-level zarr or high-level xarray + to_zarr(append_dim='time'). I would perhaps consider practicing this on a smaller dataset to make sure you understand the details first.

rabernat commented 4 years ago

Btw, @sebastienlanglois + @raspstephan, we could really use your input on pangeo-forge here: https://github.com/pangeo-forge/staged-recipes/issues

Basically, we want to make it as easy as possible for people to maintain these cloud-optimized dataset going forward. To do that, we are trying to gather as many use cases as possible.

hansukyang commented 4 years ago

Hi @sebastienlanglois, @raspstephan and @rabernat, not sure if you'd be interested but I have a poor-man's workflow (without HPC) for re-chunking ERA5 to be optimized for time-series extraction. The end result (time-series data) is available through a REST API (https://oikolab.com) in JSON format, with a couple of dozen parameters from 1980 and updated daily. About 30TB worth of ERA5 parameters were processed on a consumer-tier NAS with 16GB memory, taking about a week or so to run. It's lower on my priority, but will also be processing ERA5-Land data shortly, a substantially larger dataset.

rabernat commented 4 years ago

Hi @hansukyang--thanks for this generous offer!

We have not tended to use JSON within Pangeo, although I can imagine there are some advantages, particularly for javascript-based analysis / visualization. Is JSON the primary storage format, or are you using Zarr under the hood? Have you looked into Zarr.js. It's possible that one might be able to bypass JSON completely, even for interactive js applications. @jhamman has been working on this.