Closed adriaat closed 6 years ago
These are new features being developed by the CSIRO contributors to ODC, @petewa and @rtaib . We are currently working on an AWS based ODC execution engine to support large scale parallel processing, which directly exploits the AWS S3 storage driver. Documentation is a little sparse at the moment but the S3 storage driver is a drop in replacement for the default file based driver. There's a optional flag on the necessary API calls (like datacube.load, the ingester, etc) which tells the system to use the S3 driver. The storage driver handling is documented here but I suspect you already found that. We are working on a technical paper about the driver which should be available in January.
I will let @petewa know about this thread if he hasn't already spotted it. Let us know your specific questions here and we'll respond.
Hi @adriaat,
As @woodcockr mentions, documentation is sparse and is something we need to improve.
The S3 driver extension serves 2 core purposes:
In more detail:
Things in store for the future:
S3 Ingestion:
I am assuming you are already familiar with the ingest process using the default NetCDF driver.
I'll cover the differences when ingesting to S3:
[default]
aws_access_key_id = <Access key ID>
aws_secret_access_key = <Secret access key>
datacube -v system init -s3
docs/config_samples/ingester/ls5_nbar_albers_s3.yaml
Settings which you will need to change/add:
container: bucket_name (must exist prior to ingest)
storage:
driver: s3 (select s3 driver)
datacube --driver s3 -v ingest -c ~/yamls/ls5_nbar_albers_s3.yaml --executor multiproc 8
Usage:
Usage is the same except you now are able to use the use_threads option in datacube.load(). This will parallelise the data access from S3.
e.g. import datacube dc = datacube.Datacube() nbar = dc.load(product='ls5_nbar_albers', x=(149, 150), y=(-35, -36), use_threads=True)
Hi,
Many thanks for your help. I am working with @adriaat on OpenDataCube and I am experiencing the same issue of configuring datacube with S3.
When I run the command: datacube -v system init --create-s3-tables
I got the following message:
2017-11-17 17:52:21,845 3828 datacube INFO Running datacube command: /home/ubuntu/Datacube/datacube_env/bin/datacube -v system init -s3 2017-11-17 17:52:21,855 3828 DriverManager INFO Import Failed for Driver plugin "S3TestDriver", skipping. 2017-11-17 17:52:21,856 3828 DriverManager INFO Import Failed for Driver plugin "S3Driver", skipping. Initialising database... 2017-11-17 17:52:21,862 3828 datacube.index.postgres.tables._core INFO Ensuring user roles. 2017-11-17 17:52:21,865 3828 datacube.index.postgres.tables._core INFO Adding role grants. Updated. Checking indexes/views. 2017-11-17 17:52:21,871 3828 datacube.index.postgres._api INFO Checking dynamic views/indexes. (rebuild views=True, indexes=False) Done.
So the new tables in Postgres for S3 cannot be created. The aws credentials are properly set in .aws/credentials file I am working on the develop branch and the Datacube version is : 1.5.1+218.g5f8507c Same error happens in CSIRO branch.
Thanks, Davide
Hi Davide,
The import failed message is not worded very well. You can safely ignore it. Non default drivers are skipped during initialization because they are not required. In this particular case, they are skipped because the s3 tables are not present before initialization. After you execute the 'datacube -v system init -s3', the tables should be properly created and initialized.
You can confirm by checking the following tables are created:
Thanks, Peter
Hi @petewa ,
I double checked, but unfortunately there are not new tables into the database. The only tables in the agdc schema are:
Thanks, Davide
Hi @davidedelerma, could you please try the 'datacube -v system init -s3' command with a fresh database.
The init script may not be set up to create the tables with an already initialised agdc schema.
Hi @petewa , Thanks. I dropped the database and initialized it again using the "s3" option. Now I have the three extra tables. Then I added a product (Landsat 8) and indexed 9 Landsat 8 L2 images referring to the same footprint. After that I modified the configuration file for ingestion using:
container: 'rheaodc'
storage:
driver: s3
but when I run the command:
datacube --driver s3 -v ingest -c ~/ls8_collections_sr_general_s3.yaml
I get the error:
2017-11-21 13:12:41,883 1935 datacube INFO Running datacube command: /home/ubuntu/Datacube/datacube_env/bin/datacube --driver s3 -v ingest -c /home/ubuntu/ls8_collections_sr_general_s3.yaml
2017-11-21 13:12:41,897 1935 DriverManager INFO Import Failed for Driver plugin "S3TestDriver", skipping.
2017-11-21 13:12:41,898 1935 DriverManager INFO Import Failed for Driver plugin "S3Driver", skipping.
Traceback (most recent call last):
File "/home/ubuntu/Datacube/datacube_env/bin/datacube", line 11, in <module>
load_entry_point('datacube', 'console_scripts', 'datacube')()
File "/home/ubuntu/Datacube/datacube_env/lib/python3.5/site-packages/click-6.7-py3.5.egg/click/core.py", lin e 722, in __call__
return self.main(*args, **kwargs)
File "/home/ubuntu/Datacube/datacube_env/lib/python3.5/site-packages/click-6.7-py3.5.egg/click/core.py", lin e 697, in main
rv = self.invoke(ctx)
File "/home/ubuntu/Datacube/datacube_env/lib/python3.5/site-packages/click-6.7-py3.5.egg/click/core.py", lin e 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/Datacube/datacube_env/lib/python3.5/site-packages/click-6.7-py3.5.egg/click/core.py", lin e 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/Datacube/datacube_env/lib/python3.5/site-packages/click-6.7-py3.5.egg/click/core.py", lin e 535, in invoke
return callback(*args, **kwargs)
File "/home/ubuntu/Datacube/agdc-v2/datacube/ui/click.py", line 199, in new_func
return f(parsed_config, *args, **kwargs)
File "/home/ubuntu/Datacube/agdc-v2/datacube/ui/click.py", line 226, in with_driver_manager
validate_connection=expect_initialised) as driver_manager:
File "/home/ubuntu/Datacube/agdc-v2/datacube/drivers/manager.py", line 78, in __init__
self.set_current_driver(default_driver_name or self._DEFAULT_DRIVER)
File "/home/ubuntu/Datacube/agdc-v2/datacube/drivers/manager.py", line 198, in set_current_driver
driver_name, ', '.join(self.__drivers.keys())))
ValueError: Default driver "s3" is not available in NetCDF CF
Thanks, Davide
Here is the full configuration file for ingestion:
source_type: ls8_collections_sr_scene
output_type: ls8_lasrc_general_s3
description: Landsat 8 USGS Collection 1 Higher Level SR scene proessed using LaSRC. Resampled to 30m EPSG:4326 projection with a sub degree tile size.
location: '/datacube/ingested_data'
file_path_template: 'LS8_OLI_LaSRC/General/LS8_OLI_LaSRC_4326_{tile_index[0]}_{tile_index[1]}_{start_time}.nc'
global_attributes:
title: CEOS Data Cube Landsat Surface Reflectance
summary: Landsat 8 Operational Land Imager ARD prepared by NASA on behalf of CEOS.
source: LaSRC surface reflectance product prepared using USGS Collection 1 data.
institution: CEOS
instrument: OLI_TIRS
cdm_data_type: Grid
keywords: AU/GA,NASA/GSFC/SED/ESD/LANDSAT,REFLECTANCE,ETM+,TM,OLI,EARTH SCIENCE
keywords_vocabulary: GCMD
platform: LANDSAT_8
processing_level: L2
product_version: '2.0.0'
product_suite: USGS Landsat Collection 1
project: CEOS
coverage_content_type: physicalMeasurement
references: http://dx.doi.org/10.3334/ORNLDAAC/1146
license: https://creativecommons.org/licenses/by/4.0/
naming_authority: gov.usgs
acknowledgment: Landsat data is provided by the United States Geological Survey (USGS).
container: 'rheaodc'
storage:
driver: s3
crs: EPSG:4326
tile_size:
longitude: 0.943231048326
latitude: 0.943231048326
resolution:
longitude: 0.000269494585236
latitude: -0.000269494585236
chunking:
longitude: 200
latitude: 200
time: 1
dimension_order: ['time', 'latitude', 'longitude']
measurements:
- name: coastal_aerosol
dtype: int16
nodata: -9999
resampling_method: nearest
src_varname: 'sr_band1'
zlib: True
attrs:
long_name: "Surface Reflectance 0.43-0.45 microns (Coastal Aerosol)"
alias: "band_1"
- name: blue
dtype: int16
nodata: -9999
resampling_method: nearest
src_varname: 'sr_band2'
zlib: True
attrs:
long_name: "Surface Reflectance 0.45-0.51 microns (Blue)"
alias: "band_2"
- name: green
dtype: int16
nodata: -9999
resampling_method: nearest
src_varname: 'sr_band3'
zlib: True
attrs:
long_name: "Surface Reflectance 0.53-0.59 microns (Green)"
alias: "band_4"
- name: red
dtype: int16
nodata: -9999
resampling_method: nearest
src_varname: 'sr_band4'
zlib: True
attrs:
long_name: "Surface Reflectance 0.64-0.67 microns (Red)"
alias: "band_4"
- name: nir
dtype: int16
nodata: -9999
resampling_method: nearest
src_varname: 'sr_band5'
zlib: True
attrs:
long_name: "Surface Reflectance 0.85-0.88 microns (Near Infrared)"
alias: "band_5"
- name: swir1
dtype: int16
nodata: -9999
resampling_method: nearest
src_varname: 'sr_band6'
zlib: True
attrs:
long_name: "Surface Reflectance 1.57-1.65 microns (Short-wave Infrared)"
alias: "band_6"
- name: swir2
dtype: int16
nodata: -9999
resampling_method: nearest
src_varname: 'sr_band7'
zlib: True
attrs:
long_name: "Surface Reflectance 2.11-2.29 microns (Short-wave Infrared)"
alias: "band_7"
- name: 'pixel_qa'
dtype: int32
nodata: 1
resampling_method: nearest
src_varname: 'pixel_qa'
zlib: True
attrs:
long_name: "Pixel Quality Attributes Bit Index"
alias: [pixel_qa]
- name: 'aerosol_qa'
dtype: uint8
nodata: 0
resampling_method: nearest
src_varname: 'sr_aerosol'
zlib: True
attrs:
long_name: "Aerosol Quality Attributes Bit Index"
alias: [sr_aerosol_qa, sr_aerosol]
- name: 'radsat_qa'
dtype: int32
nodata: 1
resampling_method: nearest
src_varname: 'radsat_qa'
zlib: True
attrs:
long_name: "Radiometric Saturation Quality Attributes Bit Index"
alias: [radsat_qa]
- name: 'solar_azimuth'
dtype: int16
nodata: -32768
resampling_method: nearest
src_varname: 'solar_azimuth_band4'
zlib: True
attrs:
long_name: "Solar Azimuth Angle for Band 4"
alias: [solar_azimuth_band4]
- name: 'solar_zenith'
dtype: int16
nodata: -32768
resampling_method: nearest
src_varname: 'solar_zenith_band4'
zlib: True
attrs:
long_name: "Solar Zenith Angle for Band 4"
alias: [solar_zenith_band4]
- name: 'sensor_azimuth'
dtype: int16
nodata: -32768
resampling_method: nearest
src_varname: 'sensor_azimuth_band4'
zlib: True
attrs:
long_name: "Sensor Azimuth Angle for Band 4"
alias: [sensor_azimuth_band4]
- name: 'sensor_zenith'
dtype: int16
nodata: -32768
resampling_method: nearest
src_varname: 'sensor_zenith_band4'
zlib: True
attrs:
long_name: "Sensor Zenith Angle for Band 4"
alias: [sensor_zenith_band4]
Hi @davidedelerma,
I made a mistake in my earlier post. "Import Failed for Driver plugin "S3Driver", skipping." Should not be happening. This means that there was an import error when loading the the S3Driver.
In relation to my earlier message. I misread it as "Driver plugin "S3Driver" failed requirements check, skipping" which is different.
It looks like you are on EC2 with a flavor of Ubuntu. What version of Ubuntu? I've been using 'Ubuntu 16.04.1 LTS' personally. Can you please post the output from pip freeze.
The following should give you the proper environment:
conda config --prepend channels conda-forge
conda config --prepend channels conda-forge/label/dev
conda create -n agdc python=3.6
conda env update -n agdc --file datacube-core/.travis/environment.yaml
source activate agdc
Greetings @davidedelerma
How did you go?
Hello all,
@davidedelerma and I have been working on it. We both use Ubuntu 16.04.03 LTS. We created a new environment following @petewa commands from a previous comment. Using this environment, we ingested a single time slice of Landsat 8 data in a S3 bucket. There was no error in the ingestion. We loaded this data with use_threads=False
and we did not have any problem: the data is correctly loaded, even though it is slow. Surprisingly, the instance we used had 4 vcpu, and the 4 of them were being used in the ingestion process. On the other hand, we also tried loading the data with use_threads=True
. In this case, the jupyter notebook never finishes (at least for a reasonable amount of waiting time) loading the data, and the consumption of the vcpus is nearly 0.
In the same environment, we ingested two time slices of Landsat 8 data. We loaded these data using use_threads=False
and use_threads=True
, and in both cases we were able to load it. Afterwards, we ingested more data and we had in total 9 time slices of Landsat 8 data. We loaded a region of interest of these 9 time slices, due to limitations with memory, but we did not have any problem as well.
We tested all of the above with both develop and csiro/execution-engine branches, where the two yield datacube version 1.5.1.
At this point, we ask ourselves
use_threads
actually paralellise the data access from S3? Since when we used use_threads=False
we saw that our all vcpus were being used.Elaborating on the point 2, we did a small benchmark of the elapsed time when loading the ingested data for a region of interest for the 9 time slices, when ingested in the S3 using the s3 driver and when ingested locally, and also testing the develop and the csiro/execution-engine branches. The results show the average time over 3 runs.
Branch develop | Branch csiro/execution-engine | |
---|---|---|
In S3 with s3 driver | use_threads=False : 33.24 s, use_threads=True : 25.29 s |
use_threads=False : 83.08 s, use_threads=True : 72.64 s |
In EC2 | use_threads=False : 6.71 s, use_threads=True : 6.45 s |
use_threads=False : 6.63 s, use_threads=True : 6.22 s |
We expected some problem if we loaded the ingested data in EC2 using use_threads=True
, given a previous comment. However, it works and it is faster! On the other hand, we are also surprised to see that the csiro/execution-engine branch is faster when loading the ingested data in EC2 but much slower when the data is in S3.
There is something else that may be worth sharing. The environment we were using before was the one created by following the installation guide for Open Data Cube in the www.opendatacube.org website: Data Cube Installation Guide. However, when using the environment created with the environment file datacube-core/.travis/environment.yaml
, there is an error with the postgres database that before we did not have:
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
Before we did not have a db_hostname
in the .datacube.conf
file. To solve this error, we had to set either db_hostname: localhost
or db_hostname: /var/run/postgresql
in the .datacube.conf
file.
One thing that comes to mind is latency and compute performance. There are a bunch of reasons, main one being AWS architects tends steer us away from using s3 as a file system e.g. s3fs
use_threads controls the threading on retrieving multiple time slices and stitching datasets. This is currently a compromise. Having said that, the underlying S3IO library is threaded by default.
Regarding your benchmarks, with a 200x200 chunking size, I'm not surprised with the results you are seeing.
Try these parameters for S3 as a starting point and tweak from there.
Example LS5 ingest config. https://github.com/opendatacube/datacube-core/blob/develop/docs/config_samples/ingester/ls5_nbar_albers_s3.yaml
I aim for objects around ~10-20 MB in size. You can look in your S3 bucket to see this, post ingest.
use_threads should not make a difference when ingesting to EBS with the default netcdf driver. use_threads should make a big difference when ingesting to EBS with the s3-file driver.
The csiro/csiro-execution engine branch tries to balance performance with memory consumption. It scales the number of threads w.r.t to the number of cores. It's something like: threads = ncores*2 for stitching/stacking. processes = ncores for pulling data from s3.
Hello,
Thank you for your reply @petewa. I played a little bit with the suggested changes and with the two branches (develop
and csiro/execution-engine
), producing benchmarks to qualify and quantify the results. Since there are just too many results to post them here, I give an overall impression of them. The comments below refer to loading times of 9 Lansat 8 time slices, sharing the comments for both branches (unless specified otherwise) and in an Amazon EC2 t2.xlarge instance (vcPU: 4, memory: 16 GiB). Also, please note that when I refer to "tile size" I am referring to the ratio "tile size/resolution" (I did not change the resolution), and when referred to a chunking parameter, the chunks are squared (therefore a chunking parameter of 200
stands for 200x200
).
200
to 2000
: the loading time is much faster (40%-85% faster!), specially if use_threads=True
.3500
to 8000
: it is faster, but much slower than only changing the chunking parameter from 200
to 2000
.3500
to 8000
and the chunking parameter from 200
to 2000
: it is faster, but the loading times are similar to only changing the chunking parameter, as if the tile size did not have an enhancing effect.use_threads
with either the default NetCDF driver or the S3-file driver.
The way I ingest to the EBS with the S3-file driver is the following:
driver: s3
in the storage:
list, i.e.
storage:
driver: s3
crs: EPSG:4326
tile_size:
longitude: 0.943231048326
latitude: 0.943231048326
resolution:
longitude: 0.000269494585236
latitude: -0.000269494585236
chunking:
longitude: 200
latitude: 200
time: 1
dimension_order: ['time', 'latitude', 'longitude']
datacube -v ingest -c ~/ls8_test.yaml
. If I run datacube --driver s3 -v ingest -c ~/ls8_test.yaml
, I get the error datacube.utils.DatacubeException: Missing 'container' parameter, cannot write to storage.
I understand that, if I set the container parameter, the data will be ingested to the S3 bucket rather than into the EBS. Am I right?develop
and csiro/execution-engine
result approximately equal in timewise terms, except when the chunking parameter is set small, i.e. 200
, regardless of the tile size. In these cases, the differences are like the ones in the table of my previous comment.use_threads=True
, and with chunking set to 2000
, regardless of the tile size. It is also quicker to load data chunked to 2000
with use_threads=False
than data chunked to 200
and use_threads=True
.Now everything makes more sense. @petewa says in the previous comment that use_threads
should make a big difference when ingesting to EBS with the S3-file driver, in contrast with using the default NetCDF driver. However, I do not experience this behaviour and it may come from my side. Did I ingest the files correctly to the EBS using the S3 driver? I only have NetCDF files in the ingestion directory specified for the S3-file driver, is that OK?
Thanks for benchmarking and providing such a nice summary @adriaat ! I think @petewa may be travelling right now, so let me answer your question about s3-file ingestion: yes, you need to define the container
field in the config file, the same way you would for normal s3 ingestion, the idea being that a container
is a driver-independent location for data. What you put there is the absolute path where the data will be stored. Set this very carefully if running as root
.
Then, you definitely need the --driver s3
flag during ingestion. By doing so, you should NOT see any NetCDF files in the ingestion directory as you do now (when the flag is not specified, it will default to NetCDF). That should also change your ingestion timings, hopefully for the better.
Hi @adriaat, I'm back in the office now.
Performance tips:
I noticed the way you ingest to EBS eith the s3-file driver is incorrect which is probably why you are still seeing netcdf files.
The driver should be s3-test.
E.g.
container: '/home/ubuntu/data/output' storage: driver: s3-test
The container parameter must be set for s3 or s3-file ingest.
I did some performance tests for a first attempt of balancing performance vs memory. I'm sharing this as it's a good indication of real-life performance. As you can see performance varies quite a bit depending on the instance type and threading parameters.
These numbers are using the example ingest yaml, which has not been tuned and results in a small object size of 1-5MB. Performance should be better with a slightly bigger object size around 10-20MB.
What you don't see from below is, it's pulling back all bands, not just 1.
=================================
(time: 12, x: 421, y: 490) 30MB nbar = dc.load(product='ls5_nbar_albers', x=(149.25, 149.35), y=(-35.25, -35.35), use_threads=True)
c4.large s3 4.2s c4.xlarge s3 2.3s c4.2xlarge s3 1.3s c4.4xlarge s3 0.9s c4.8xlarge s3 0.9s
read performance from s3 varies from 0.9s to 4.2s.
=================================
(12, 840, 980) 119MB nbar = dc.load(product='ls5_nbar_albers', x=(149.15, 149.35), y=(-35.15, -35.35)) netcdf 2.641545412540436 s3 2.600525255203247
c4.large s3 9.8s c4.xlarge s3 5.5s c4.2xlarge s3 2.5s c4.4xlarge s3 1.7s c4.8xlarge s3 1.4s
read performance from s3 varies from 1.4s to 9.8s.
=================================
(time: 12, x: 1680, y: 1959) 500MB nbar = dc.load(product='ls5_nbar_albers', x=(149.15, 149.55), y=(-35.15, -35.55), use_threads=True, loops=100) netcdf 7.269922971725464 s3 4.752055556774139 s3-file 2.585359718799591
c4.large s3 22s c4.xlarge s3 11.4s c4.2xlarge s3 5.6s c4.4xlarge s3 3.1s c4.8xlarge s3 2.6s
read performance from s3 varies from 2.6s to 22s.
The performance varies is purely in the number of IO streams possible.
=================================
(time: 12, x: 3778, y: 4408) 1.2GB nbar = dc.load(product='ls5_nbar_albers', x=(149.05, 149.95), y=(-35.05, -35.95), use_threads=True)
c4.large s3 out of memory c4.xlarge s3 out of memory c4.2xlarge s3 11.5s c4.4xlarge s3 6.6s c4.8xlarge s3 5.2s r4.16xlarge s3 7s
=================================
(12, 840, 980) 119MB nbar = dc.load(product='ls5_nbar_albers', x=(149.15, 149.35), y=(-35.15, -35.35))
c4.xlarge 4 cores 1.25GB 8/16 6.7s 0.88GB 8/8 6.1s 0.86GB 16/8 5.5s 0.60GB 16/4 6.5s 0.61GB 32/4 6.5s 1.51GB 32/30 5.6s
c4.2xlarge 8 cores 1.57GB 32/30 2.9s 0.87GB 32/8 3.8s 1.28GB 32/16 2.8s 1.22GB 16/16 2.7s 0.90GB 16/8 3.2s 1.36GB 8/16 3.7s
Here you can see the different amount of threads/cores and the effect it has on memory consumption (0.60 - 1.57GB) and read performance (2.7s - 6.7s) on c4.xlarge and c4.2xlarge.
(time: 12, x: 1680, y: 1959) 500MB nbar = dc.load(product='ls5_nbar_albers', x=(149.15, 149.55), y=(-35.15, -35.55), use_threads=True, loops=100)
c4.2xlarge 8 cores 1.66GB 8/16 6.8s 1.6GB 16/16 5.6s 1.21GB 16/8 6.3s 1.19GB 32/8 6.3s 1.62GB 32/16 5.6s
(time: 12, x: 3778, y: 4408) 1.2GB nbar = dc.load(product='ls5_nbar_albers', x=(149.05, 149.95), y=(-35.05, -35.95), use_threads=True)
2xcores/1xcores
c4.8xlarge 36 cores 7.44G 64/64 5.3s 7.6G 32/64 4.5s 6.86G 16/64 6s 5.96G 32/32 4.64s 5.12GB 32/16 6.3s
c4.2xlarge 8 cores 4.78G 32/16 12s 4.9G 32/32 12s 4.34G 32/8 13s 3.66G 16/8 13s 3.32G 8/8 16s 3.79G 8/16 16s
Looks like this thread is finished for now so I will close it. For those interested we have a paper with performance benchmark testing for the S3 driver coming out in a few weeks. We’ll share the results with the broader community. There’s also been some mergers and changes on the S3 array IO work to better integrate it with the computational environments for both HPC and AWS EC2 clusters. A few defects to resolve after recent mergers but should be updated soon on the develop branch.
Hi everyone, in the last release of branch develop: https://github.com/opendatacube/datacube-core/blob/develop/datacube/scripts/system.py there's no more -s3
option in the datacube -v system init
command, how can we get the three tables s3_dataset , s3_dataset_chunk and s3_dataset_mapping ? Thanks in advance
Hi @palmoreck,
The develop branch altered the way configurations are done: see http://datacube-core.readthedocs.io/en/latest/user/config.html
This is one way to do it:
example .datacube.conf
[user]
#default_environment: datacube
default_environment: s3aio_env
[datacube]
db_hostname: <host_name>
db_database: <db_name>
db_username: <db_username>
[s3aio_env]
db_hostname: <host_name>
db_database: <db_name>
db_username: <db_username>
index_driver: s3aio_index
You can use the default_environment
to set the default environment you wish to use datacube
or s3aio_env
in this example.
After editing .datacube.conf
as above, you can do a datacube -v system init
and it will pick up the default configuration set in .datacube.conf
and create the relevant tables.
The s3 driver code in develop is a little stale, and is missing one important fix csiro/develop-fix-drivers
and other improvements in csiro/execution-engine
great¡ @petewa thank you. I'll test it.
what are the major changes in the fixes/improvements in those branches? thanks...
@palmoreck
csiro/develop-fix-drivers
implements a thread-safety fix for the driver caches implemented in develop. It's ok if you do not do multi-threaded/multi-process operation. It's dependent on your application. This is a bandaid fix, and develop should be reworked slightly to support multi-threaded/multi-process operation properly.
csiro/execution-engine
has general improvements like, including geo-coordinates on array.save() and array.load() (this is not using the Datacube Ingest mechanism), tweaking memory consumption vs performance, Datacube.save() (This basically does ingest without the need for yaml files) that works with Datacube.load() and general refactoring.
thank you for the information @petewa 👍
Maybe I'm missing something because when I execute:
$datacube -E s3aio_env system check
I get:
/usr/local/lib/python3.5/dist-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
Version: 1.6rc1
Config files: /my-user/.datacube.conf
Host: <my-host>
Database: datacube_s3
User: <my-user>
Environment: None
Index Driver: s3aio_index
Valid connection: 2018-06-21 17:57:25,148 953 datacube.drivers.driver_cache WARNING Failed to resolve driver datacube.plugins.index::s3aio_index
Traceback (most recent call last):
File "/usr/local/bin/datacube", line 11, in <module>
sys.exit(cli())
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 87, in augment_usage_errors
yield
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/datacube/ui/click.py", line 192, in new_func
return f(parsed_config, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/datacube/scripts/system.py", line 91, in check
index = index_connect(local_config=local_config)
File "/usr/local/lib/python3.5/dist-packages/datacube/index/_api.py", line 38, in index_connect
driver_name, len(index_drivers()), ', '.join(index_drivers())
RuntimeError: No index driver found for 's3aio_index'. 1 available: default
Do you know what I could be missing?
My team and I are working with a Dockerfile for our application where we have some lines like:
...
RUN pip3 install --upgrade pip==9.0.3
RUN pip3 install dask distributed bokeh
##Install missing package for open datacube:
RUN pip3 install --upgrade python-dateutil
#Dependencies for datacube
RUN pip3 install numpy && pip3 install GDAL==$(gdal-config --version) --global-option=build_ext --global-option='-I/usr/include/gdal' && pip3 install rasterio==1.0b1 --no-binary rasterio
RUN pip3 install scipy cloudpickle sklearn lightgbm fiona django boto3 SharedArray pathos zstandard --no-binary fiona
#datacube:
RUN pip3 install datacube==1.6rc1
...
Thanks in advance
ok, i think i need to install datacube like:
pip3 install git+https://github.com/opendatacube/datacube-core.git@develop
so i have now:
$datacube --version
Open Data Cube core, version 1.6rc1+106.gef03ad2.dirty
$datacube -E s3aio_env system check
Version: 1.6rc1+106.gef03ad2.dirty
Config files: /my-user/.datacube.conf
Host: <my-host>
Database: datacube_s3
User: <my-user>
Environment: None
Index Driver: s3aio_index
Valid connection: Database not initialised:
No DB schema exists. Have you run init?
datacube system init
(I also installed psycopg2-binary)
@palmoreck need to install with [s3]
feature flag
pip3 install datacube[s3]
or for latest from github
pip3 install git+https://github.com/opendatacube/datacube-core.git@develop#egg=datacube[s3]
ok, thanks @Kirill888 @loicdtx is asking a question in slack regarding GridWorkflow. I hope we can solve that. (That question still uses 1.6rc1+106.gef03ad2.dirty version of datacube)
I posted a response in slack. I haven't seen that error before. I tested it with the csiro/execution-engine
branch and list_cells
works fine. For some reason grid_spec is None.
Hello, thanks @petewa
We are trying to install datacube branch csiro/execution-engine with index support for s3.
If we install it with:
$pip3 install git+https://github.com/opendatacube/datacube-core.git@csiro/execution-engine#egg=datacube[s3]
we get:
Collecting datacube[s3] from git+https://github.com/opendatacube/datacube-core.git@csiro/execution-engine#egg=datacube[s3]
Cloning https://github.com/opendatacube/datacube-core.git (to csiro/execution-engine) to /tmp/pip-build-h6nqnek6/datacube
error: Your local changes to the following files would be overwritten by checkout:
docs/config_samples/dataset_types/ls_usgs_sr_scene.yaml
Please, commit your changes or stash them before you can switch branches.
Aborting
Command "git checkout -q 8dac2e59066ef06ebe66850ec9e272b3c0c336da" failed with error code 1 in /tmp/pip-build-h6nqnek6/datacube
And if we install it like:
$git clone -b csiro/execution-engine https://github.com/opendatacube/datacube-core.git /my-user/datacube-core && pip3 install /my-user/datacube-core/.[s3]
datacube is installed but when executing $datacube system init --no-init-users
we get no support for s3 index driver :
2018-06-25 17:25:06,512 22 datacube.drivers.driver_cache WARNING Failed to resolve driver datacube.plugins.index::s3aio_index
Traceback (most recent call last):
File "/usr/local/bin/datacube", line 11, in <module>
sys.exit(cli())
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 87, in augment_usage_errors
yield
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/datacube/ui/click.py", line 192, in new_func
return f(parsed_config, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/datacube/ui/click.py", line 217, in with_index
validate_connection=expect_initialised)
File "/usr/local/lib/python3.5/dist-packages/datacube/index/_api.py", line 38, in index_connect
driver_name, len(index_drivers()), ', '.join(index_drivers())
RuntimeError: No index driver found for 's3aio_index'. 1 available: default
How we can manage to install datacube from csiro/execution-engine
branch with driver index s3
support??
Thanks in advance
I just tried this out, and it seems to work. I left out the apt-get installs.
conda create -n dc36 python=3.6
source activate dc36
git clone git@github.com:opendatacube/datacube-core.git
cd datacube-core/
git checkout -f csiro/execution-engine
pip install '.[test,celery,s3]'
createdb -U <user_name> -h <host_name> <db_name>
datacube -v system init
There are some additions/changes to .datacube.conf
[user]
#default_environment: datacube
default_environment: s3aio_env
[datacube]
db_hostname: <host_name>
db_database: <db_name>
db_username: <user_name>
redis.host:
redis.port: 6379
redis.db: 0
redis.password:
redis_celery.host:
redis_celery.port: 6379
redis_celery.db: 1
redis_celery.password:
execution_engine.result_bucket: eetest2
execution_engine.use_s3: False
[s3aio_env]
db_hostname: <host_name>
db_database: <db_name>
db_username: <user_name>
index_driver: s3aio_index
redis.host:
redis.port: 6379
redis.db: 0
redis.password:
redis_celery.host:
redis_celery.port: 6379
redis_celery.db: 1
redis_celery.password:
execution_engine.result_bucket: eetest2
execution_engine.use_s3: False
The default_environment
variable in .datacube.conf
is what datacube system init
picks up when initialising the postgres database.
Hi,
Thank you for your answers, we really appreciate them. I don't want to be a headache nor being negative but I copy-pasted the lines of @petewa in our Dockerfile with little changes (for example for cloning as I dont' have ssh access and pip3):
RUN git clone https://github.com/opendatacube/datacube-core.git && cd datacube-core/ && git checkout -f csiro/execution-engine && pip3 install '.[test,celery,s3]'
(output of this RUN):
Successfully installed OWSLib-0.16.0 SharedArray-3.0.0 amqp-2.3.2 astroid-1.6.5 atomicwrites-1.1.5 billiard-3.5.0.3 boto3-1.4.3 cachetools-2.1.0 celery-4.2.0 cf-units-1.2.0 cffi-1.11.5 cftime-1.0.0 compliance-checker-4.0.1 coverage-4.5.1 cython-0.28.3 datacube-1.6rc1+145.g8dac2e5 dill-0.2.8.2 graphviz-0.8.3 hypothesis-3.64.0 isodate-0.6.0 isort-4.3.4 jsonschema-2.6.0 kombu-4.2.1 lazy-object-proxy-1.3.1 lxml-4.2.2 mccabe-0.6.1 mock-2.0.0 more-itertools-4.2.0 multiprocess-0.70.6.1 netcdf4-1.4.0 objgraph-3.4.0 pandas-0.23.1 pathos-0.2.2 pbr-4.0.4 pendulum-2.0.2 pluggy-0.6.0 pox-0.2.4 ppft-1.6.4.8 py-1.5.3 pycodestyle-2.4.0 pycparser-2.18 pygeoif-0.7 pylint-1.9.2 pypeg2-2.15.2 pyproj-1.9.5.1 pytest-3.6.2 pytest-cov-2.5.1 pytest-timeout-1.3.0 pytzdata-2018.5 redis-2.10.6 regex-2017.7.28 s3transfer-0.1.13 singledispatch-3.4.0.3 sqlalchemy-1.2.8 tabulate-0.8.2 vine-1.1.4 wrapt-1.10.11 xarray-0.10.7 zstandard-0.9.1
used the same .datacube.conf
(structure) but when doing the init, it says we don't have support for s3 driver:
$ datacube system init
2018-06-26 19:07:49,098 32 datacube.drivers.driver_cache WARNING Failed to resolve driver datacube.plugins.index::s3aio_index
Traceback (most recent call last):
File "/usr/local/bin/datacube", line 11, in <module>
sys.exit(cli())
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 87, in augment_usage_errors
yield
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/datacube/ui/click.py", line 192, in new_func
return f(parsed_config, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/datacube/ui/click.py", line 217, in with_index
validate_connection=expect_initialised)
File "/usr/local/lib/python3.5/dist-packages/datacube/index/_api.py", line 38, in index_connect
driver_name, len(index_drivers()), ', '.join(index_drivers())
RuntimeError: No index driver found for 's3aio_index'. 1 available: default
maybe we're missing something but we can't figure out what it is. Please see our apt-get installations
:
FROM ubuntu:xenial
USER root
#see: https://github.com/Yelp/dumb-init/ for a justification of next line:
RUN apt-get update && apt-get install -y wget curl && wget -O /usr/local/bin/dumb-init https://github.com/Yelp/dumb-init/releases/download/v$(curl -s https://api.github.com/repos/Yelp/dumb-init/releases/latest| grep tag_name|sed -n 's/ ".*v\(.*\)",/\1/p')/dumb-init_$(curl -s https://api.github.com/repos/Yelp/dumb-init/releases/latest| grep tag_name|sed -n 's/ ".*v\(.*\)",/\1/p')_amd64 && chmod +x /usr/local/bin/dumb-init
#dependencies
RUN apt-get update && apt-get install -y \
openssh-server \
openssl \
sudo \
wget \
nano \
software-properties-common \
python-software-properties \
git \
vim \
vim-gtk \
htop \
build-essential \
libssl-dev \
libffi-dev \
cmake \
python3-dev \
python3-pip \
python3-setuptools \
ca-certificates \
postgresql-client \
awscli \
pip3 install --upgrade pip==9.0.3
#Install spatial libraries
RUN add-apt-repository -y ppa:ubuntugis/ubuntugis-unstable && apt-get -qq update
RUN apt-get install -y \
netcdf-bin \
libnetcdf-dev \
ncview \
libproj-dev \
libgeos-dev \
gdal-bin \
libgdal-dev
##Install dask distributed
RUN pip3 install dask distributed --upgrade && pip3 install bokeh
##Install missing package for open datacube:
RUN pip3 install --upgrade python-dateutil
#Dependencies for datacube and app
RUN pip3 install numpy && pip3 install GDAL==$(gdal-config --version) --global-option=build_ext --global-option='-I/usr/include/gdal' && pip3 install rasterio==1.0b1 --no-binary rasterio
RUN pip3 install scipy cloudpickle sklearn lightgbm fiona django --no-binary fiona
RUN pip3 install --no-cache --no-binary :all: psycopg2
#Next for compliance-checker installation
RUN apt-get clean && apt-get update && apt-get install -y locales
RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LC_ALL en_US.UTF-8
#datacube:
RUN git clone https://github.com/opendatacube/datacube-core.git && cd datacube-core/ && git checkout -f csiro/execution-engine && pip3 install '.[test,celery,s3]'
....(next lines invole installations for our app).....
Thanks for your support!
Hi @palmoreck,
I'm happy to help, we can have a webex hookup or similar and I can walk you through the process.
This might be the fastest way to resolve this. Please pm me on slack peterw
I can't see anything wrong at first glace. The error comes from the IndexDriverCache not having the s3aio_index. This is set in setup.py
. i.e. https://github.com/opendatacube/datacube-core/blob/csiro/execution-engine/setup.py
Since you are cloning the repo fresh, this shouldn't be the issue.
There are some differences in the apt libraries and apt-repository you are using.
Try using this: https://github.com/opendatacube/datacube-core/blob/csiro/execution-engine/Dockerfile
as a basis.
Failing that, is there another old datacube installed in pip/pip3?
pip uninstall datacube
pip3 uninstall datacube
before
pip3 install '.[test,celery,s3]'
Another thing to make sure is you have .aws/credentials file in your home directory.
When running Open Data Cube in the cloud, I would like to have datasets in Amazon S3 buckets without having to store them in my EC2 instance. I have seen that in datacube-core release 1.5.2 the new features
were added. I have read the few documentation there is on these features but I am confused. Is there any documentation on what these features are capable of or any example on how to use them?