Closed pl-marasco closed 2 years ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Great! You download geotiff, right?
No data are in NetCDF format; GeoTiff can be downloaded through the legacy portal.
So they don't have Cloud-Optimized GeoTIFF?
I would not suggest an end-user to download a 2GB netCDF to use a tiny portion for a study. I would tell them to use the legacy portal and order the portion they need.
There is not an easy answer to your question; as already mentioned you can order GeoTIFF (not COG) through the portal but the manifest exposes only NetCDF files.
Ordering can take up to a couple of hours (just to be positive) and that's why most of the users prefer to download the entire dataset and then resample it.
In any case the notebook is entirely based on precooked subset.
View / edit / reply to this conversation on ReviewNB
acocac commented on 2022-07-22T08:25:46Z ----------------------------------------------------------------
I suggest using hvplot
with sliders to choose lat
and lon
@acocac I'm not fully convinced, The point is to make a comparison over the same exact point and sliders can take longer to be defined over two different datasets. In any case, I would love to change my mind once I have see an example. Could you provide me a working one?
@pl-marasco you can see some examples in the hvplot
documentation (see here). You could play with another variable different to lat,lon. I think it would be nice all notebooks maximise interactive plotting where possible.
Ahhh ok ... so isn't a specific note but more in general. Ok, as there is already an example of hvplot
with a slider on the time dimension I thought that was enough but in any case I will add some more example. As already mentioned feel free to create a merge request on my fork if you have any idea
Is there a way to access the data from cloud object storage instead of download?
@rsignell-usgs not that I know
The only alternative I could imagine would be to fully rely on OpenEO. I've tested the option to make a STAC request but unfortunately seems that there is an issue that has been confirmed by the distributor. Moreover, even if the idea of using OpenEo isn't bad per se, not all the products are available; Long Term Statistics are not.
An option could be to mix the approaches; store the LTS directly on the EGI deployment and download the S3 NDVI time series through the Vito OpenEO https://openeo.vito.be .
Later on, to give everybody the possibility to run the notebook, a copy of the LTS data should be made available through Zenodo as @annefou suggested.
@pl-marasco , the reason I asked is because I've personally experienced a few workshops that struggled when step 1 was "download data". Especially if this on a JupyterHub or Binder hub where the local filesystems are NSF mounted and slow. What is the infrastructure that people will be using to run the notebooks? (I apologize if this is already documented/discussed)
Was thinking that perhaps that perhaps the tuturial data could be downloaded and then put on the cloud so that attendees could see what a cloud-based workflow looks like?
Thanks @rsignell-usgs You are right. Downloading datasets during a training can be challenging with poor wifi. Datasets will be made available on the infrastructure.
For the infrastructure, the plan is to use the jupyter deployment (jupyterhub + dask cluster) we are setting up on the European Open Science Cloud.
@rsignell-usgs I'm really happy that you pointed out your perspective and experience on this. As seems that there are lots of concerns about this I decided to change the notebook and partially rely on the OpenEO infrastructure as already mentioned before; this will give us the possibility to select and download smaller areas avoiding big files.
The LTS, as Anne mentioned, will be made available on the infrastructure.
Once the stress test we are conducting is over, and if you are interested, you are more than welcome to test the entire notebook directly on the infrastructure.
@annefou I tried googling but failed. Is the European Open Science Cloud running on a commercial cloud provider, or is it running OpenStack/Ceph at an HPC center or something?
@rsignell-usgs currently, things are deployed on EGI, which make things much more clear in term of Infrastructure. So this is closer to the second option, a federation of ressources of European data centers, often academic HPC facilities with some Openstack resources on it.
EOSC is a bit blurry to me, this is kind of EGI 2.0, with more resources, and probably a part of commercial cloud, I'm really not sure of it. Maybe @annefou knows more about it.
View / edit / reply to this conversation on ReviewNB
guillaumeeb commented on 2022-07-27T07:08:13Z ----------------------------------------------------------------
One of the interest of Pangeo stack is to be able to browse big datasets without processing all if needed (or processing everything in parallel), but there, this is OpenEO we use to filter the request and work on a small subset of data. It's still interesting to see the point of OpenEO vs Pangeo.
pl-marasco commented on 2022-07-27T08:22:40Z ----------------------------------------------------------------
This is not a specific answer to your but is just to make the point of the situation.
My original thought was to avoid OpenEO and take leverage of the STAC+Pangeo stack. Unfortunately, I run into lots of problems like:
almost all the participants in this discussion suggested avoiding the nightmare of downloading them live through low-speed WiFi.
The idea would be to download it from CESNET infrastructure, isn't it? But I agree this is still bad to dwnload it live with doens of users.
There is no cloud object storage for these datasets
Let's make one on CESNET with a subset!
Moreover there is no option to have in NetCDF format, only in GeoTiff; this means that you will get a single file per each band you ordered (from my perspective is a nightmare that I would never like to teach as a correct approach)
And you could have several bands in one GeoTiff. But anyway, this means if we download a (big) subset, we'll have to rework the data to make it analysis ready?
pl-marasco commented on 2022-07-28T14:39:43Z ----------------------------------------------------------------
The idea would be to download it from CESNET infrastructure, isn't it? But I agree this is still bad to dwnload it live with doens of users.
No, data are not available from CESNET so we have to rely on VITO
Let's make one on CESNET with a subset!
That was my first suggestion and then we opt for make the notebook more usable and let users download the data.
And you could have several bands in one GeoTiff.
Nope, if you order data from VITO you can't as they will split bands in different files.
But anyway, this means if we download a (big) subset, we'll have to rework the data to make it analysis ready?
Yep, in any case this will be the case for the Long Term Statistics that are converted in a Zarr format
_guillaumeeb commented on 2022-07-28T16:48:31Z_ ----------------------------------------------------------------No, data are not available from CESNET so we have to rely on VITO
Yeah, I was meaning that users should be on CESNET infrastructure when they download data, so hopefully, they have a good bandwidth.
That was my first suggestion and then we opt for make the notebook more usable and let users download the data.
This is great to show these steps. But as we are talking of scaling, we could need a pre-download.
Nope, if you order data from VITO you can't as they will split bands in different files.
Right, I didn't want to say you can have it that way on VITO, but that it would be possible, that was unclear.
View / edit / reply to this conversation on ReviewNB
guillaumeeb commented on 2022-07-27T07:08:14Z ----------------------------------------------------------------
Line #1. datacube.download("C_GLS_NDVI_20220101_20220701_CENTRALITALY_S3_2.nc", format="NetCDF")
This is probably out of scope here, but it would be really interresting to be able to use Xarray directly on this datacube
object without having to download it all as a NetCDF.
I don't know if that's possible and I have some doubts. What you can do is to achieve most of the things we do directly using OpenEO. But then the question is even to easy to ask... why not using entirely Open EO ? This is why I tried to push away the use of it.
_guillaumeeb commented on 2022-07-28T13:38:14Z_ ----------------------------------------------------------------I don't thinks it's possible currently, but don't you think it would be nice to be able to do that? It's probably feasible with some coding project, you must already have all the metadata for building a DataArray before downloading things.
And for the question, yes I ask it! Maybe a point is when you want to do the processing at scale? How will OpenEO handle a processing on a wide area? Does it distribute things on CGL side?
_pl-marasco commented on 2022-07-28T14:14:38Z_ ----------------------------------------------------------------yes indeed and should be somthing that has to be discussed with the OpenEO counterpart.
I'm not an expert on OpenEO so I'm not the right person to ask, anyhow seems that if you need you can batch some computation directly on the server and just get results; that's something I would understand better especially in the vision of scalability over different platforms.
View / edit / reply to this conversation on ReviewNB
guillaumeeb commented on 2022-07-27T07:08:15Z ----------------------------------------------------------------
This is a bit tough, even if probably necessary. It would probably be better to have some example to illustrate dimensions and coordinates.
pl-marasco commented on 2022-07-27T08:09:38Z ----------------------------------------------------------------
Agree, is only there to remind me that I've to better describe all this terminology with some examples.
View / edit / reply to this conversation on ReviewNB
guillaumeeb commented on 2022-07-27T07:08:15Z ----------------------------------------------------------------
What's the 9.537e-7
?
pl-marasco commented on 2022-07-27T08:09:58Z ----------------------------------------------------------------
Conversion factor from Bytes to MB
guillaumeeb commented on 2022-07-28T13:40:04Z ----------------------------------------------------------------
I would suggest to make it more explicit:
print(f'{np.round(cgls_ds.NDVI.nbytes / 1024**2, 2)} MB')
View / edit / reply to this conversation on ReviewNB
guillaumeeb commented on 2022-07-27T07:08:16Z ----------------------------------------------------------------
through
View / edit / reply to this conversation on ReviewNB
guillaumeeb commented on 2022-07-27T07:08:17Z ----------------------------------------------------------------
Line #1. NDVI = cgls_ds.NDVI * (1/250) - 0.08
I think I get it, but maybe it should be explained? So when querying OpenEO, you don't get the real physical values?
Right, in the previous version of the notebook I was explaining all this through examples. The original datasets are coming with the correct attributes that define the scale and the shift and can shift from one to the other through the mask_scale parameter. Unfortunatelly, the OpenEO version doesn't allow this.
_acocac commented on 2022-07-28T20:11:17Z_ ----------------------------------------------------------------Something is wrong in OpenEO then, missing all relevant attributes (metadata) :/ Is it only related to the version? Is it in their project roadmap?
_pl-marasco commented on 2022-07-29T06:40:55Z_ ----------------------------------------------------------------I can't answer to your request as I'm not involved in the decision process.
Any how I was a little bit surprised too but most probably in a couple of days I'm going to revert some part of the notebook as I realized that is too important demonstrate some component that through OpenEO can't be highlighted.
_acocac commented on 2022-07-29T10:32:23Z_ ----------------------------------------------------------------
No worries, it'll be good to report these to OpenEO. It seems the OpenEO developers share their roadmap through the GitHub repository milestones, see here: https://github.com/Open-EO/openeo-api/milestones.
Information loss is well-addressed in rioxarray
, https://corteva.github.io/rioxarray/html/getting_started/manage_information_loss.html?highlight=gdal. Not sure how OpenEO handles their collections and why they're missing such relevant attributes, unfortunately.
There is no real lost of attributes, data are correctly distributed according to the PUM.
In any case IMHO is more related on how Terrascope/VITO has implemented the OpenEO server. Most probably this data repository is even considered a beta so we only have to be glad that they did what they did.
View / edit / reply to this conversation on ReviewNB
guillaumeeb commented on 2022-07-27T07:08:18Z ----------------------------------------------------------------
Why is that?
pl-marasco commented on 2022-07-27T08:13:00Z ----------------------------------------------------------------
This comes from the past; instead of using a different band for the flags, some are directly coded in the NDVI values [251,252,253,254,255]. 255 stands for invalid pixels, 254 sea, 253 snow or cloud, etc.
The best option would have been to use the quality band in conjunction with these values but then I would have spent time explaining why they are coded in bits and most of the users get immediately lost.
guillaumeeb commented on 2022-07-28T13:41:30Z ----------------------------------------------------------------
Then you should probably explain a bit this, just one or two sentences (but maybe it was foreseen).
pl-marasco commented on 2022-07-28T14:03:29Z ----------------------------------------------------------------
yep, the notebook is an ongoing project and most of the comments are not there yet. Anyhow thanks to point me out.
First time I use ReviewNB, not convinced for now :), let's see the discussions.
Overall, there is probably to mich thing in this single notebook, I think you should split it, maybe before introducing Rioxarray and Zarr.
@rsignell-usgs @guillaumeeb already explains most of it. Thanks!
EOSC (European Open Science Cloud) is a federated cloud for research data to support EU science (https://eosc-portal.eu/about/eosc). There is not one single provider which can make EOSC difficult to navigate and understand. In our case, resources are provided by CESNET (https://www.cesnet.cz/cesnet/?lang=en) in the framework of the EGI-ACE European project (https://www.egi.eu/project/egi-ace/). CESNET is providing cloud resources (CPU + later GPUs) with openstack. To facilitate deployments and promote EOSC, EGI-ACE develops a tool called IM (https://imdocs.readthedocs.io/en/stable/intro.html) is a tool that ease the access and the usability of IaaS clouds by automating the VMI selection, deployment, configuration, software installation, monitoring and update of Virtual Appliances. It supports APIs from a large number of virtual platforms, making user applications cloud-agnostic.
I don't know if that's possible and I have some doubts. What you can do is to achieve most of the things we do directly using OpenEO. But then the question is even to easy to ask... why not using entirely Open EO ? This is why I tried to push away the use of it.
View entire conversation on ReviewNB
Agree, is only there to remind me that I've to better describe all this terminology with some examples.
View entire conversation on ReviewNB
Right, in the previous version of the notebook I was explaining all this through examples. The original datasets are coming with the correct attributes that define the scale and the shift and can shift from one to the other through the mask_scale parameter. Unfortunatelly, the OpenEO version doesn't allow this.
View entire conversation on ReviewNB
This comes from the past; instead of using a different band for the flags, some are directly coded in the NDVI values [251,252,253,254,255]. 255 stands for invalid pixels, 254 sea, 253 snow or cloud, etc.
The best option would have been to use the quality band in conjunction with these values but then I would have spent time explaining why they are coded in bits and most of the users get immediately lost.
View entire conversation on ReviewNB
This is not a specific answer to your but is just to make the point of the situation.
My original thought was to avoid OpenEO and take leverage of the STAC+Pangeo stack. Unfortunately, I run into lots of problems like:
Second time I use ReviewNB and I don't like at all ... But I took the challenge
Ok, that's right .. but don't forget that I've a scientific target and the objective is to fine the balance between teach and create a product.
almost all the participants in this discussion suggested avoiding the nightmare of downloading them live through low-speed WiFi.
The idea would be to download it from CESNET infrastructure, isn't it? But I agree this is still bad to dwnload it live with doens of users.
There is no cloud object storage for these datasets
Let's make one on CESNET with a subset!
Moreover there is no option to have in NetCDF format, only in GeoTiff; this means that you will get a single file per each band you ordered (from my perspective is a nightmare that I would never like to teach as a correct approach)
And you could have several bands in one GeoTiff. But anyway, this means if we download a (big) subset, we'll have to rework the data to make it analysis ready?
--- View entire conversation on ReviewNBI don't thinks it's possible currently, but don't you think it would be nice to be able to do that? It's probably feasible with some coding project, you must already have all the metadata for building a DataArray before downloading things.
And for the question, yes I ask it! Maybe a point is when you want to do the processing at scale? How will OpenEO handle a processing on a wide area? Does it distribute things on CGL side?
View entire conversation on ReviewNB
I would suggest to make it more explicit:
print(f'{np.round(cgls_ds.NDVI.nbytes / 1024**2, 2)} MB')--- View entire conversation on ReviewNB
Then you should probably explain a bit this, just one or two sentences (but maybe it was foreseen).
View entire conversation on ReviewNB
yep, the notebook is an ongoing project and most of the comments are not there yet. Anyhow thanks to point me out.
View entire conversation on ReviewNB
yes indeed and should be somthing that has to be discussed with the OpenEO counterpart.
I'm not an expert on OpenEO so I'm not the right person to ask, anyhow seems that if you need you can batch some computation directly on the server and just get results; that's something I would understand better especially in the vision of scalability over different platforms.
View entire conversation on ReviewNB
The idea would be to download it from CESNET infrastructure, isn't it? But I agree this is still bad to dwnload it live with doens of users.
No, data are not available from CESNET so we have to rely on VITO
Let's make one on CESNET with a subset!
That was my first suggestion and then we opt for make the notebook more usable and let users download the data.
And you could have several bands in one GeoTiff. But anyway, this means if we download a (big) subset, we'll have to rework the data to make it analysis ready?
Nope, if you order data from VITO you can't as they will split bands in different files. A pre cook dataset will be made available on the
--- View entire conversation on ReviewNB
No, data are not available from CESNET so we have to rely on VITO
Yeah, I was meaning that users should be on CESNET infrastructure when they download data, so hopefully, they have a good bandwidth.
That was my first suggestion and then we opt for make the notebook more usable and let users download the data.
This is great to show these steps. But as we are talking of scaling, we could need a pre-download.
Nope, if you order data from VITO you can't as they will split bands in different files.
Right, I didn't want to say you can have it that way on VITO, but that it would be possible, that was unclear.
--- View entire conversation on ReviewNB
Something is wrong in OpenEO then, missing all relevant attributes (metadata) :/ Is it only related to the version? Is it in their project roadmap?
View entire conversation on ReviewNB
View / edit / reply to this conversation on ReviewNB
acocac commented on 2022-07-28T20:23:05Z ----------------------------------------------------------------
Not sure if you're aware but geopandas allows reading compressed files too. I'd suggest reducing the lines to:
GAUL = gpd.read_file('zip+https://mars.jrc.ec.europa.eu/asap/files/gaul1_asap.zip')
Thanks !!! I didn't had any idea as I'm not use to work with GeoPandas :-)
I can't answer to your request as I'm not involved in the decision process.
Any how I was a little bit surprised too but most probably in a couple of days I'm going to revert some part of the notebook as I realized that is too important demonstrate some component that through OpenEO can't be highlighted.
View entire conversation on ReviewNB
Thanks !!! I didn't had any idea as I'm not use to work with GeoPandas :-)
View entire conversation on ReviewNB
No worries, it'll be good to report these to OpenEO. It seems the EO developers share their roadmap through the GitHub repository milestones, see here: https://github.com/Open-EO/openeo-api/milestones.
Information loss is well-addressed in rioxarray, https://corteva.github.io/rioxarray/html/getting_started/manage_information_loss.html?highlight=gdal. Not sure how OpenEO handles their collections and why they're missing such relevant attributes, unfortunately.
View entire conversation on ReviewNB
There is no real lost of attributes, data are correctly distributed according to the PUM.
In any case IMHO is more related on how Terrascope/VITO has implemented the OpenEO server. Most probably this data repository is even considered a beta so we only have to be glad that they did what they did.
View entire conversation on ReviewNB
Is there a way to access the data from cloud object storage instead of download?
hi @rsignell-usgs yes, we finally have access to swift, & uploaded file, and I just created kerchunk catalogue. (Thanks @keewis for your help ) Plz see the pull request ; https://github.com/pangeo-data/foss4g-2022/pull/10 looking forward to have some feedback.
@pl-marasco , i concatenated your 36 files not in a clean way, I'll need to check with you tomorrow.
Even if both (notebook and package) are not fully ready I think that could be useful for all of you to have all the components needed to start testing the notebook