multiply-org / prior-engine

GNU General Public License v3.0
0 stars 6 forks source link

Make use of Data Access Component #27

Open TonioF opened 6 years ago

TonioF commented 6 years ago

Here is my idea how the Data Access Component could be integrated: The Prior Engine will have its own Data Access Component which is decoupled from the rest of the system. It deals only with (a) the prior .vrt-files, (b) the .tiff- or other files which form the input to the .vrt-files, and (c) any required aux-data. For (a) and (b), a data store will be used which will also be available for the orchestrator. The Data Access Component will be configured to find the aux data that is required by the prior engine (or its prior creators, to be exact) . This data may not be not be available locally from the beginning, so it might need to be downloaded. All this data will be stored in the .multiply-folder in the user's home directory. The workflow is this: The Prior Engine will be asked for a prior file for a variable that covers some spatial and temporal extent. The Data Access Component will check whether a prior (.vrt-)file exists that meets these requirements. If so, it will be returned. If not (or if the user wishes to use his / her dedicated auxdata files), the prior engine will be triggered to compute such a prior file. After this has been done, the file (and, if necessary, files it references) will be permanently stored by the Data Access Component for future use. This could be done by writing a wrapper script around the prior engine or even by integrating it into the prior engine module itself. An open question is how to design this so that the prior creators would get the aux data they need from the Data Access Component.

tramsauer commented 5 years ago

To follow up on this, for soil (moisture) priors I would need access to :

SMAP data:

currently accessed via

wget --load-cookies ~/.urs_cookies --save-cookies ~/.urs_cookies --keep-session-cookies --no-check-certificate --auth-no-challenge=on -r --reject "index.html*" -np -e robots=off -i download.txt

with the password etc set up as described here.

and the download.txt file only containing the links to the single files, e.g.

https://n5eil01u.ecs.nsidc.org/SMAP/SPL4SMAU.004/2015.03.31/SMAP_L4_SM_aup_20150331T030000_Vv4030_001.h5

which can easily be adjusted with the according date information, however

This is potentially the same login procedure as in the MODIS data access component (Earthdata-Login). -> is the MODIS data access component currently working? (did not try it myself, however it is referencing modules of multiply-core, that are not available..)

ESA CCI Climatology

I would 'ship' these smallish data sets with the engine..


So the Questions:

what do you think @TonioF? Should I try to adjust the lpdaac version to my needs and make an PR?

TonioF commented 5 years ago

Hi, the MODIS data access is working (of course only if the NASA servers are not down. In the past, we had some issues with that). However, the access is tailored to particular data types, currently only to MCD43A1.006 and MCD15A2H.006. The latter had been added because it was required for another prior. I can extend the component to access a new data type.

When should the download and pre-processing of the SMAP tiles occur (relatively time consuming)?

What does relatively mean? Can we include it in a workflow run or is this not feasible? My feeling is that we can include it, but you know this better. The Data Access and the Prior Creation are decoupled. We would tell the Data Access which data it needs to download, then we would tell the Prior Engine where the data is stored. We could integrate the Data Access into the Prior Engine, but this would create additional complexity to the Prior Engine which you probably don't want.

Are there other better options? accessing such a repo remotely via vrt data access?

That is an option and we can consider it. However, the VRT Data Access is best suited for temporally static data such as DEMs. For data with different time stamps we would need to apply a few changes, so I think it is probably not the way to got.

Should I try to adjust the lpdaac version to my needs and make an PR?

You can do so ... ideally, you don't have to make a lot of changes. I suggest having a skype session over this so we can see what is the best way to move forward here.