Python package and notebook update

View / edit / reply to this conversation on ReviewNB

acocac commented on 2022-07-22T08:25:46Z ----------------------------------------------------------------

I suggest using hvplot with sliders to choose lat and lon

pl-marasco commented 2 years ago

@acocac I'm not fully convinced, The point is to make a comparison over the same exact point and sliders can take longer to be defined over two different datasets. In any case, I would love to change my mind once I have see an example. Could you provide me a working one?

acocac commented 2 years ago

@pl-marasco you can see some examples in the hvplot documentation (see here). You could play with another variable different to lat,lon. I think it would be nice all notebooks maximise interactive plotting where possible.

pl-marasco commented 2 years ago

Ahhh ok ... so isn't a specific note but more in general. Ok, as there is already an example of hvplot with a slider on the time dimension I thought that was enough but in any case I will add some more example. As already mentioned feel free to create a merge request on my fork if you have any idea

rsignell-usgs commented 2 years ago

Is there a way to access the data from cloud object storage instead of download?

pl-marasco commented 2 years ago

@rsignell-usgs not that I know

The only alternative I could imagine would be to fully rely on OpenEO. I've tested the option to make a STAC request but unfortunately seems that there is an issue that has been confirmed by the distributor. Moreover, even if the idea of using OpenEo isn't bad per se, not all the products are available; Long Term Statistics are not.

An option could be to mix the approaches; store the LTS directly on the EGI deployment and download the S3 NDVI time series through the Vito OpenEO https://openeo.vito.be .

Later on, to give everybody the possibility to run the notebook, a copy of the LTS data should be made available through Zenodo as @annefou suggested.

rsignell-usgs commented 2 years ago

@pl-marasco , the reason I asked is because I've personally experienced a few workshops that struggled when step 1 was "download data". Especially if this on a JupyterHub or Binder hub where the local filesystems are NSF mounted and slow. What is the infrastructure that people will be using to run the notebooks? (I apologize if this is already documented/discussed)

Was thinking that perhaps that perhaps the tuturial data could be downloaded and then put on the cloud so that attendees could see what a cloud-based workflow looks like?

annefou commented 2 years ago

Thanks @rsignell-usgs You are right. Downloading datasets during a training can be challenging with poor wifi. Datasets will be made available on the infrastructure.

For the infrastructure, the plan is to use the jupyter deployment (jupyterhub + dask cluster) we are setting up on the European Open Science Cloud.

pl-marasco commented 2 years ago

@rsignell-usgs I'm really happy that you pointed out your perspective and experience on this. As seems that there are lots of concerns about this I decided to change the notebook and partially rely on the OpenEO infrastructure as already mentioned before; this will give us the possibility to select and download smaller areas avoiding big files.

The LTS, as Anne mentioned, will be made available on the infrastructure.

Once the stress test we are conducting is over, and if you are interested, you are more than welcome to test the entire notebook directly on the infrastructure.

rsignell-usgs commented 2 years ago

@annefou I tried googling but failed. Is the European Open Science Cloud running on a commercial cloud provider, or is it running OpenStack/Ceph at an HPC center or something?

guillaumeeb commented 2 years ago

@rsignell-usgs currently, things are deployed on EGI, which make things much more clear in term of Infrastructure. So this is closer to the second option, a federation of ressources of European data centers, often academic HPC facilities with some Openstack resources on it.

EOSC is a bit blurry to me, this is kind of EGI 2.0, with more resources, and probably a part of commercial cloud, I'm really not sure of it. Maybe @annefou knows more about it.

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

guillaumeeb commented on 2022-07-27T07:08:13Z ----------------------------------------------------------------

One of the interest of Pangeo stack is to be able to browse big datasets without processing all if needed (or processing everything in parallel), but there, this is OpenEO we use to filter the request and work on a small subset of data. It's still interesting to see the point of OpenEO vs Pangeo.

pl-marasco commented on 2022-07-27T08:22:40Z ----------------------------------------------------------------

This is not a specific answer to your but is just to make the point of the situation.

My original thought was to avoid OpenEO and take leverage of the STAC+Pangeo stack. Unfortunately, I run into lots of problems like:

As there is no way to make a REST request for data on the CGLS the only option I was able to imagine is a stupid facilitator that takes leverage of the manifest made available from VITO.
Files from the CGLS are more or less 2 GB each; almost all the participants in this discussion suggested avoiding the nightmare of downloading them live through low-speed WiFi.
There is no cloud object storage for these datasets; none of AWS GC Planetary computer or others has ingested it (most probably you can only find it on DIAS). If you need more details I can give you. Would be useful if a pool of users requested the ingestion from one of the big players
Copernicus Global Land Service has no STAC catalog; moreover, the STAC compatibility of the OpenEO version isn't properly set up so the requests fail. I sent an email to ask for clarification and an update and the responsible answer me that they will achieve this on the next tender.
The only way to get data from the CGLS over a specific AOI is to make an order on the legacy portal. Right now request are even fast to be processed but I can't imagine with 30 simultaneous. Moreover there is no option to have in NetCDF format, only in GeoTiff; this means that you will get a single file per each band you ordered (from my perspective is a nightmare that I would never like to teach as a correct approach).
I love all the concerns that you girls&guys have about the CGLS distribution. Are already 5 years since I'm expressing my opinion on it and at least now I feel a little bit less alone.

_guillaumeeb commented on 2022-07-28T13:35:30Z_ ----------------------------------------------------------------

almost all the participants in this discussion suggested avoiding the nightmare of downloading them live through low-speed WiFi.

The idea would be to download it from CESNET infrastructure, isn't it? But I agree this is still bad to dwnload it live with doens of users.

There is no cloud object storage for these datasets

Let's make one on CESNET with a subset!

Moreover there is no option to have in NetCDF format, only in GeoTiff; this means that you will get a single file per each band you ordered (from my perspective is a nightmare that I would never like to teach as a correct approach)

And you could have several bands in one GeoTiff. But anyway, this means if we download a (big) subset, we'll have to rework the data to make it analysis ready?

pl-marasco commented on 2022-07-28T14:39:43Z ----------------------------------------------------------------

The idea would be to download it from CESNET infrastructure, isn't it? But I agree this is still bad to dwnload it live with doens of users.

No, data are not available from CESNET so we have to rely on VITO

Let's make one on CESNET with a subset!

That was my first suggestion and then we opt for make the notebook more usable and let users download the data.

And you could have several bands in one GeoTiff.

Nope, if you order data from VITO you can't as they will split bands in different files.

But anyway, this means if we download a (big) subset, we'll have to rework the data to make it analysis ready?

Yep, in any case this will be the case for the Long Term Statistics that are converted in a Zarr format

_guillaumeeb commented on 2022-07-28T16:48:31Z_ ----------------------------------------------------------------

No, data are not available from CESNET so we have to rely on VITO

Yeah, I was meaning that users should be on CESNET infrastructure when they download data, so hopefully, they have a good bandwidth.

That was my first suggestion and then we opt for make the notebook more usable and let users download the data.

This is great to show these steps. But as we are talking of scaling, we could need a pre-download.

Nope, if you order data from VITO you can't as they will split bands in different files.

Right, I didn't want to say you can have it that way on VITO, but that it would be possible, that was unclear.

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

guillaumeeb commented on 2022-07-27T07:08:14Z ----------------------------------------------------------------

Line #1.    datacube.download("C_GLS_NDVI_20220101_20220701_CENTRALITALY_S3_2.nc", format="NetCDF")

This is probably out of scope here, but it would be really interresting to be able to use Xarray directly on this datacube object without having to download it all as a NetCDF.

_pl-marasco commented on 2022-07-27T08:08:42Z_ ----------------------------------------------------------------

I don't know if that's possible and I have some doubts. What you can do is to achieve most of the things we do directly using OpenEO. But then the question is even to easy to ask... why not using entirely Open EO ? This is why I tried to push away the use of it.

_guillaumeeb commented on 2022-07-28T13:38:14Z_ ----------------------------------------------------------------

I don't thinks it's possible currently, but don't you think it would be nice to be able to do that? It's probably feasible with some coding project, you must already have all the metadata for building a DataArray before downloading things.

And for the question, yes I ask it! Maybe a point is when you want to do the processing at scale? How will OpenEO handle a processing on a wide area? Does it distribute things on CGL side?

_pl-marasco commented on 2022-07-28T14:14:38Z_ ----------------------------------------------------------------

yes indeed and should be somthing that has to be discussed with the OpenEO counterpart.

I'm not an expert on OpenEO so I'm not the right person to ask, anyhow seems that if you need you can batch some computation directly on the server and just get results; that's something I would understand better especially in the vision of scalability over different platforms.

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

guillaumeeb commented on 2022-07-27T07:08:15Z ----------------------------------------------------------------

This is a bit tough, even if probably necessary. It would probably be better to have some example to illustrate dimensions and coordinates.

pl-marasco commented on 2022-07-27T08:09:38Z ----------------------------------------------------------------

Agree, is only there to remind me that I've to better describe all this terminology with some examples.

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

guillaumeeb commented on 2022-07-27T07:08:15Z ----------------------------------------------------------------

What's the 9.537e-7?

pl-marasco commented on 2022-07-27T08:09:58Z ----------------------------------------------------------------

Conversion factor from Bytes to MB

guillaumeeb commented on 2022-07-28T13:40:04Z ----------------------------------------------------------------

I would suggest to make it more explicit:

print(f'{np.round(cgls_ds.NDVI.nbytes / 1024**2, 2)} MB')

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

guillaumeeb commented on 2022-07-27T07:08:16Z ----------------------------------------------------------------

through

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

guillaumeeb commented on 2022-07-27T07:08:17Z ----------------------------------------------------------------

Line #1.    NDVI = cgls_ds.NDVI * (1/250) - 0.08

I think I get it, but maybe it should be explained? So when querying OpenEO, you don't get the real physical values?

_pl-marasco commented on 2022-07-27T08:11:34Z_ ----------------------------------------------------------------

Right, in the previous version of the notebook I was explaining all this through examples. The original datasets are coming with the correct attributes that define the scale and the shift and can shift from one to the other through the mask_scale parameter. Unfortunatelly, the OpenEO version doesn't allow this.

_acocac commented on 2022-07-28T20:11:17Z_ ----------------------------------------------------------------

Something is wrong in OpenEO then, missing all relevant attributes (metadata) :/ Is it only related to the version? Is it in their project roadmap?

_pl-marasco commented on 2022-07-29T06:40:55Z_ ----------------------------------------------------------------

I can't answer to your request as I'm not involved in the decision process.

Any how I was a little bit surprised too but most probably in a couple of days I'm going to revert some part of the notebook as I realized that is too important demonstrate some component that through OpenEO can't be highlighted.

_acocac commented on 2022-07-29T10:32:23Z_ ----------------------------------------------------------------

No worries, it'll be good to report these to OpenEO. It seems the OpenEO developers share their roadmap through the GitHub repository milestones, see here: https://github.com/Open-EO/openeo-api/milestones.

Information loss is well-addressed in rioxarray, https://corteva.github.io/rioxarray/html/getting_started/manage_information_loss.html?highlight=gdal. Not sure how OpenEO handles their collections and why they're missing such relevant attributes, unfortunately.

_pl-marasco commented on 2022-07-29T14:04:45Z_ ----------------------------------------------------------------

There is no real lost of attributes, data are correctly distributed according to the PUM.

In any case IMHO is more related on how Terrascope/VITO has implemented the OpenEO server. Most probably this data repository is even considered a beta so we only have to be glad that they did what they did.

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

guillaumeeb commented on 2022-07-27T07:08:18Z ----------------------------------------------------------------

Why is that?

pl-marasco commented on 2022-07-27T08:13:00Z ----------------------------------------------------------------

This comes from the past; instead of using a different band for the flags, some are directly coded in the NDVI values [251,252,253,254,255]. 255 stands for invalid pixels, 254 sea, 253 snow or cloud, etc.

The best option would have been to use the quality band in conjunction with these values but then I would have spent time explaining why they are coded in bits and most of the users get immediately lost.

guillaumeeb commented on 2022-07-28T13:41:30Z ----------------------------------------------------------------

Then you should probably explain a bit this, just one or two sentences (but maybe it was foreseen).

pl-marasco commented on 2022-07-28T14:03:29Z ----------------------------------------------------------------

yep, the notebook is an ongoing project and most of the comments are not there yet. Anyhow thanks to point me out.

guillaumeeb commented 2 years ago

First time I use ReviewNB, not convinced for now :), let's see the discussions.

Overall, there is probably to mich thing in this single notebook, I think you should split it, maybe before introducing Rioxarray and Zarr.

annefou commented 2 years ago

@rsignell-usgs @guillaumeeb already explains most of it. Thanks!

EOSC (European Open Science Cloud) is a federated cloud for research data to support EU science (https://eosc-portal.eu/about/eosc). There is not one single provider which can make EOSC difficult to navigate and understand. In our case, resources are provided by CESNET (https://www.cesnet.cz/cesnet/?lang=en) in the framework of the EGI-ACE European project (https://www.egi.eu/project/egi-ace/). CESNET is providing cloud resources (CPU + later GPUs) with openstack. To facilitate deployments and promote EOSC, EGI-ACE develops a tool called IM (https://imdocs.readthedocs.io/en/stable/intro.html) is a tool that ease the access and the usability of IaaS clouds by automating the VMI selection, deployment, configuration, software installation, monitoring and update of Virtual Appliances. It supports APIs from a large number of virtual platforms, making user applications cloud-agnostic.

pl-marasco commented 2 years ago

View entire conversation on ReviewNB

pl-marasco commented 2 years ago

Agree, is only there to remind me that I've to better describe all this terminology with some examples.

View entire conversation on ReviewNB

pl-marasco commented 2 years ago

Conversion factor from Bytes to MB

View entire conversation on ReviewNB

pl-marasco commented 2 years ago

View entire conversation on ReviewNB

pl-marasco commented 2 years ago

View entire conversation on ReviewNB

pl-marasco commented 2 years ago

This is not a specific answer to your but is just to make the point of the situation.

My original thought was to avoid OpenEO and take leverage of the STAC+Pangeo stack. Unfortunately, I run into lots of problems like:

As there is no way to make a REST request for data on the CGLS the only option I was able to imagine is a stupid facilitator that takes leverage of the manifest made available from VITO.
Files from the CGLS are more or less 2 GB each; almost all the participants in this discussion suggested avoiding the nightmare of downloading them live through low-speed WiFi.
There is no cloud object storage for these datasets; none of AWS GC Planetary computer or others has ingested it (most probably you can only find it on DIAS). If you need more details I can give you. Would be useful if a pool of users requested the ingestion from one of the big players
Copernicus Global Land Service has no STAC catalog; moreover, the STAC compatibility of the OpenEO version isn't properly set up so the requests fail. I sent an email to ask for clarification and an update and the responsible answer me that they will achieve this on the next tender.
The only way to get data from the CGLS over a specific AOI is to make an order on the legacy portal. Right now request are even fast to be processed but I can't imagine with 30 simultaneous. Moreover there is no option to have in NetCDF format, only in GeoTiff; this means that you will get a single file per each band you ordered (from my perspective is a nightmare that I would never like to teach as a correct approach).
I love all the concerns that you girls&guys have about the CGLS distribution. Are already 5 years since I'm expressing my opinion on it and at least now I feel a little bit less alone.

--- View entire conversation on ReviewNB

pl-marasco commented 2 years ago

Second time I use ReviewNB and I don't like at all ... But I took the challenge

Ok, that's right .. but don't forget that I've a scientific target and the objective is to fine the balance between teach and create a product.

guillaumeeb commented 2 years ago

almost all the participants in this discussion suggested avoiding the nightmare of downloading them live through low-speed WiFi.

The idea would be to download it from CESNET infrastructure, isn't it? But I agree this is still bad to dwnload it live with doens of users.

There is no cloud object storage for these datasets

Let's make one on CESNET with a subset!

Moreover there is no option to have in NetCDF format, only in GeoTiff; this means that you will get a single file per each band you ordered (from my perspective is a nightmare that I would never like to teach as a correct approach)

And you could have several bands in one GeoTiff. But anyway, this means if we download a (big) subset, we'll have to rework the data to make it analysis ready?

--- View entire conversation on ReviewNB

guillaumeeb commented 2 years ago

And for the question, yes I ask it! Maybe a point is when you want to do the processing at scale? How will OpenEO handle a processing on a wide area? Does it distribute things on CGL side?

View entire conversation on ReviewNB

guillaumeeb commented 2 years ago

I would suggest to make it more explicit:

print(f'{np.round(cgls_ds.NDVI.nbytes / 1024**2, 2)} MB')

--- View entire conversation on ReviewNB

guillaumeeb commented 2 years ago

Then you should probably explain a bit this, just one or two sentences (but maybe it was foreseen).

View entire conversation on ReviewNB

pl-marasco commented 2 years ago

yep, the notebook is an ongoing project and most of the comments are not there yet. Anyhow thanks to point me out.

View entire conversation on ReviewNB

pl-marasco commented 2 years ago

yes indeed and should be somthing that has to be discussed with the OpenEO counterpart.

View entire conversation on ReviewNB

pl-marasco commented 2 years ago

The idea would be to download it from CESNET infrastructure, isn't it? But I agree this is still bad to dwnload it live with doens of users.

No, data are not available from CESNET so we have to rely on VITO

Let's make one on CESNET with a subset!

That was my first suggestion and then we opt for make the notebook more usable and let users download the data.

And you could have several bands in one GeoTiff. But anyway, this means if we download a (big) subset, we'll have to rework the data to make it analysis ready?

Nope, if you order data from VITO you can't as they will split bands in different files. A pre cook dataset will be made available on the

--- View entire conversation on ReviewNB

guillaumeeb commented 2 years ago

No, data are not available from CESNET so we have to rely on VITO

Yeah, I was meaning that users should be on CESNET infrastructure when they download data, so hopefully, they have a good bandwidth.

That was my first suggestion and then we opt for make the notebook more usable and let users download the data.

This is great to show these steps. But as we are talking of scaling, we could need a pre-download.

Nope, if you order data from VITO you can't as they will split bands in different files.

Right, I didn't want to say you can have it that way on VITO, but that it would be possible, that was unclear.

--- View entire conversation on ReviewNB

acocac commented 2 years ago

Something is wrong in OpenEO then, missing all relevant attributes (metadata) :/ Is it only related to the version? Is it in their project roadmap?

View entire conversation on ReviewNB

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

acocac commented on 2022-07-28T20:23:05Z ----------------------------------------------------------------

Not sure if you're aware but geopandas allows reading compressed files too. I'd suggest reducing the lines to:

GAUL = gpd.read_file('zip+https://mars.jrc.ec.europa.eu/asap/files/gaul1_asap.zip')

_pl-marasco commented on 2022-07-29T06:41:36Z_ ----------------------------------------------------------------

Thanks !!! I didn't had any idea as I'm not use to work with GeoPandas :-)

pl-marasco commented 2 years ago

I can't answer to your request as I'm not involved in the decision process.

View entire conversation on ReviewNB

pl-marasco commented 2 years ago

Thanks !!! I didn't had any idea as I'm not use to work with GeoPandas :-)

View entire conversation on ReviewNB

acocac commented 2 years ago

No worries, it'll be good to report these to OpenEO. It seems the EO developers share their roadmap through the GitHub repository milestones, see here: https://github.com/Open-EO/openeo-api/milestones.

Information loss is well-addressed in rioxarray, https://corteva.github.io/rioxarray/html/getting_started/manage_information_loss.html?highlight=gdal. Not sure how OpenEO handles their collections and why they're missing such relevant attributes, unfortunately.

View entire conversation on ReviewNB

pl-marasco commented 2 years ago

There is no real lost of attributes, data are correctly distributed according to the PUM.

View entire conversation on ReviewNB

tinaok commented 2 years ago

Is there a way to access the data from cloud object storage instead of download?

hi @rsignell-usgs yes, we finally have access to swift, & uploaded file, and I just created kerchunk catalogue. (Thanks @keewis for your help ) Plz see the pull request ; https://github.com/pangeo-data/foss4g-2022/pull/10 looking forward to have some feedback.

@pl-marasco , i concatenated your 36 files not in a clean way, I'll need to check with you tomorrow.

pangeo-data / foss4g-2022

Python package and notebook update #5