pangeo-data / foss4g-2022

Pangeo tutorial at FOSS4G 2022
https://pangeo-data.github.io/foss4g-2022
Other
2 stars 9 forks source link

Python package and notebook update #5

Closed pl-marasco closed 2 years ago

pl-marasco commented 2 years ago

Even if both (notebook and package) are not fully ready I think that could be useful for all of you to have all the components needed to start testing the notebook

review-notebook-app[bot] commented 2 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

annefou commented 2 years ago

Great! You download geotiff, right?

pl-marasco commented 2 years ago

No data are in NetCDF format; GeoTiff can be downloaded through the legacy portal.

annefou commented 2 years ago

So they don't have Cloud-Optimized GeoTIFF?

I would not suggest an end-user to download a 2GB netCDF to use a tiny portion for a study. I would tell them to use the legacy portal and order the portion they need.

pl-marasco commented 2 years ago

There is not an easy answer to your question; as already mentioned you can order GeoTIFF (not COG) through the portal but the manifest exposes only NetCDF files.
Ordering can take up to a couple of hours (just to be positive) and that's why most of the users prefer to download the entire dataset and then resample it. In any case the notebook is entirely based on precooked subset.

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

acocac commented on 2022-07-22T08:25:46Z ----------------------------------------------------------------

I suggest using hvplot with sliders to choose lat and lon

pl-marasco commented 2 years ago

@acocac I'm not fully convinced, The point is to make a comparison over the same exact point and sliders can take longer to be defined over two different datasets. In any case, I would love to change my mind once I have see an example. Could you provide me a working one?

acocac commented 2 years ago

@pl-marasco you can see some examples in the hvplot documentation (see here). You could play with another variable different to lat,lon. I think it would be nice all notebooks maximise interactive plotting where possible.

pl-marasco commented 2 years ago

Ahhh ok ... so isn't a specific note but more in general. Ok, as there is already an example of hvplot with a slider on the time dimension I thought that was enough but in any case I will add some more example. As already mentioned feel free to create a merge request on my fork if you have any idea

rsignell-usgs commented 2 years ago

Is there a way to access the data from cloud object storage instead of download?

pl-marasco commented 2 years ago

@rsignell-usgs not that I know

The only alternative I could imagine would be to fully rely on OpenEO. I've tested the option to make a STAC request but unfortunately seems that there is an issue that has been confirmed by the distributor. Moreover, even if the idea of using OpenEo isn't bad per se, not all the products are available; Long Term Statistics are not.

An option could be to mix the approaches; store the LTS directly on the EGI deployment and download the S3 NDVI time series through the Vito OpenEO https://openeo.vito.be .

Later on, to give everybody the possibility to run the notebook, a copy of the LTS data should be made available through Zenodo as @annefou suggested.

rsignell-usgs commented 2 years ago

@pl-marasco , the reason I asked is because I've personally experienced a few workshops that struggled when step 1 was "download data". Especially if this on a JupyterHub or Binder hub where the local filesystems are NSF mounted and slow. What is the infrastructure that people will be using to run the notebooks? (I apologize if this is already documented/discussed)

Was thinking that perhaps that perhaps the tuturial data could be downloaded and then put on the cloud so that attendees could see what a cloud-based workflow looks like?

annefou commented 2 years ago

Thanks @rsignell-usgs You are right. Downloading datasets during a training can be challenging with poor wifi. Datasets will be made available on the infrastructure.

For the infrastructure, the plan is to use the jupyter deployment (jupyterhub + dask cluster) we are setting up on the European Open Science Cloud.

pl-marasco commented 2 years ago

@rsignell-usgs I'm really happy that you pointed out your perspective and experience on this. As seems that there are lots of concerns about this I decided to change the notebook and partially rely on the OpenEO infrastructure as already mentioned before; this will give us the possibility to select and download smaller areas avoiding big files.

The LTS, as Anne mentioned, will be made available on the infrastructure.

Once the stress test we are conducting is over, and if you are interested, you are more than welcome to test the entire notebook directly on the infrastructure.

rsignell-usgs commented 2 years ago

@annefou I tried googling but failed. Is the European Open Science Cloud running on a commercial cloud provider, or is it running OpenStack/Ceph at an HPC center or something?

guillaumeeb commented 2 years ago

@rsignell-usgs currently, things are deployed on EGI, which make things much more clear in term of Infrastructure. So this is closer to the second option, a federation of ressources of European data centers, often academic HPC facilities with some Openstack resources on it.

EOSC is a bit blurry to me, this is kind of EGI 2.0, with more resources, and probably a part of commercial cloud, I'm really not sure of it. Maybe @annefou knows more about it.

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

guillaumeeb commented on 2022-07-27T07:08:13Z ----------------------------------------------------------------

One of the interest of Pangeo stack is to be able to browse big datasets without processing all if needed (or processing everything in parallel), but there, this is OpenEO we use to filter the request and work on a small subset of data. It's still interesting to see the point of OpenEO vs Pangeo.


pl-marasco commented on 2022-07-27T08:22:40Z ----------------------------------------------------------------

This is not a specific answer to your but is just to make the point of the situation.

 

My original thought was to avoid OpenEO and take leverage of the STAC+Pangeo stack. Unfortunately, I run into lots of problems like:

 

  • As there is no way to make a REST request for data on the CGLS the only option I was able to imagine is a stupid facilitator that takes leverage of the manifest made available from VITO.  
  • Files from the CGLS are more or less 2 GB each; almost all the participants in this discussion suggested avoiding the nightmare of downloading them live through low-speed WiFi.
  • There is no cloud object storage for these datasets; none of AWS GC Planetary computer or others has ingested it (most probably you can only find it on DIAS). If you need more details I can give you. Would be useful if a pool of users requested the ingestion from one of the big players   
  • Copernicus Global Land Service has no STAC catalog; moreover, the STAC compatibility of the OpenEO version isn't properly set up so the requests fail. I sent an email to ask for clarification and an update and the responsible answer me that they will achieve this on the next tender.
  • The only way to get data from the CGLS over a specific AOI is to make an order on the legacy portal. Right now request are even fast to be processed but I can't imagine with 30 simultaneous. Moreover there is no option to have in NetCDF format,  only in GeoTiff; this means that you will get a single file per each band you ordered (from my perspective is a nightmare that I would never like to teach as a correct approach).
  • I love all the concerns that you girls&guys have about the CGLS distribution. Are already 5 years since I'm expressing my opinion on it and at least now I feel a little bit less alone.
_guillaumeeb commented on 2022-07-28T13:35:30Z_ ----------------------------------------------------------------
almost all the participants in this discussion suggested avoiding the nightmare of downloading them live through low-speed WiFi.

The idea would be to download it from CESNET infrastructure, isn't it? But I agree this is still bad to dwnload it live with doens of users.

There is no cloud object storage for these datasets

Let's make one on CESNET with a subset!

Moreover there is no option to have in NetCDF format, only in GeoTiff; this means that you will get a single file per each band you ordered (from my perspective is a nightmare that I would never like to teach as a correct approach)

And you could have several bands in one GeoTiff. But anyway, this means if we download a (big) subset, we'll have to rework the data to make it analysis ready?

pl-marasco commented on 2022-07-28T14:39:43Z ----------------------------------------------------------------

The idea would be to download it from CESNET infrastructure, isn't it? But I agree this is still bad to dwnload it live with doens of users.

No, data are not available from CESNET so we have to rely on VITO

 Let's make one on CESNET with a subset!

That was my first suggestion and then we opt for make the notebook more usable and let users download the data.

And you could have several bands in one GeoTiff.

Nope, if you order data from VITO you can't as they will split bands in different files.

But anyway, this means if we download a (big) subset, we'll have to rework the data to make it analysis ready?

Yep, in any case this will be the case for the Long Term Statistics that are converted in a Zarr format

_guillaumeeb commented on 2022-07-28T16:48:31Z_ ----------------------------------------------------------------
No, data are not available from CESNET so we have to rely on VITO

Yeah, I was meaning that users should be on CESNET infrastructure when they download data, so hopefully, they have a good bandwidth.

That was my first suggestion and then we opt for make the notebook more usable and let users download the data.

This is great to show these steps. But as we are talking of scaling, we could need a pre-download.

Nope, if you order data from VITO you can't as they will split bands in different files.

Right, I didn't want to say you can have it that way on VITO, but that it would be possible, that was unclear.