Amazon S3 support - Githubissues

opendatacube / datacube-core

Open Data Cube analyses continental scale Earth Observation data through time

http://www.opendatacube.org

Apache License 2.0

512 stars 177 forks source link

Amazon S3 support #323

Closed adriaat closed 6 years ago

adriaat commented 6 years ago

When running Open Data Cube in the cloud, I would like to have datasets in Amazon S3 buckets without having to store them in my EC2 instance. I have seen that in datacube-core release 1.5.2 the new features

Support for AWS S3 array storage
Driver Manager support for NetCDF, S3, S3-file drivers

were added. I have read the few documentation there is on these features but I am confused. Is there any documentation on what these features are capable of or any example on how to use them?

woodcockr commented 6 years ago

These are new features being developed by the CSIRO contributors to ODC, @petewa and @rtaib . We are currently working on an AWS based ODC execution engine to support large scale parallel processing, which directly exploits the AWS S3 storage driver. Documentation is a little sparse at the moment but the S3 storage driver is a drop in replacement for the default file based driver. There's a optional flag on the necessary API calls (like datacube.load, the ingester, etc) which tells the system to use the S3 driver. The storage driver handling is documented here but I suspect you already found that. We are working on a technical paper about the driver which should be available in January.

I will let @petewa know about this thread if he hasn't already spotted it. Let us know your specific questions here and we'll respond.

petewa commented 6 years ago

Hi @adriaat,

As @woodcockr mentions, documentation is sparse and is something we need to improve.

The S3 driver extension serves 2 core purposes:

Allow ODC to ingest directly to S3 and have ODC operates as before while storing & retrieving data from S3 transparently.
High performance externally indexed ndarray storage back-end for scalable compute on the cloud.
- use S3 like raw main memory (key = memory address, object = contiguous block of binary data)

In more detail:

ODC ingest directly to S3 and you can use ODC as before.
- ODC able to store arrays distributed to S3 objects in a compute usable format.
ODC able to query/retrieve arrays direct from S3.
- ODC able to parallelise IO to/from S3 for Data Access/Storage.
S3 IO is done in parallel streams into a shared memory array without locking.
Data in S3 are nd mapped to 1D, so for a given nd data request, you know in advance where data is.
- For a given nd query, you can map out the required IO operations before hitting S3.
  - useful for keeping the compute pipeline full.
  - this feeds into data decomposition for compute more on that later.
- data is stored as chunked row-major and compressed.
  - data in nd space is close in 1d space.

Things in store for the future:

reduced memory consumption with optimised threads/process scaling for ec2 cluster
- already in csiro/execution-engine branch, not yet in develop
improved stability with multiple threads/process
- already in csiro/execution-engine branch, not yet in develop
lazy arrays for accessing data on s3 as though it's a local array.
- already in csiro/execution-engine branch, not yet in develop
moving from postgres index to a memory cached n-dimensional index for index and storage.
- Hilbert curve for mapping nd data to 1D
storing multiple chunks in s3 object.
ODC data access API to support nd block retrieval.
- currently only 2D block retrieval and stacking over time.

S3 Ingestion:

I am assuming you are already familiar with the ingest process using the default NetCDF driver.

I'll cover the differences when ingesting to S3:

You will need an AWS account and a .aws/credentials file:

[default]
aws_access_key_id = <Access key ID>
aws_secret_access_key = <Secret access key>

Initialise the database with an extra s3 flag. This creates the extra tables in the database for s3.

datacube -v system init -s3

Example ingest config yaml with extra s3 parameters

docs/config_samples/ingester/ls5_nbar_albers_s3.yaml

Settings which you will need to change/add:
container: bucket_name  (must exist prior to ingest)
storage:
  driver: s3 (select s3 driver)

Ingest with an extra s3 parameter -- driver s3. This tells ODC to use the s3 driver for ingest.
```
datacube --driver s3 -v ingest -c ~/yamls/ls5_nbar_albers_s3.yaml --executor multiproc 8
```

Usage:

Usage is the same except you now are able to use the use_threads option in datacube.load(). This will parallelise the data access from S3.

e.g. import datacube dc = datacube.Datacube() nbar = dc.load(product='ls5_nbar_albers', x=(149, 150), y=(-35, -36), use_threads=True)

davidedelerma commented 6 years ago

Hi,

Many thanks for your help. I am working with @adriaat on OpenDataCube and I am experiencing the same issue of configuring datacube with S3.

When I run the command: datacube -v system init --create-s3-tables I got the following message:

2017-11-17 17:52:21,845 3828 datacube INFO Running datacube command: /home/ubuntu/Datacube/datacube_env/bin/datacube -v system init -s3 2017-11-17 17:52:21,855 3828 DriverManager INFO Import Failed for Driver plugin "S3TestDriver", skipping. 2017-11-17 17:52:21,856 3828 DriverManager INFO Import Failed for Driver plugin "S3Driver", skipping. Initialising database... 2017-11-17 17:52:21,862 3828 datacube.index.postgres.tables._core INFO Ensuring user roles. 2017-11-17 17:52:21,865 3828 datacube.index.postgres.tables._core INFO Adding role grants. Updated. Checking indexes/views. 2017-11-17 17:52:21,871 3828 datacube.index.postgres._api INFO Checking dynamic views/indexes. (rebuild views=True, indexes=False) Done.

So the new tables in Postgres for S3 cannot be created. The aws credentials are properly set in .aws/credentials file I am working on the develop branch and the Datacube version is : 1.5.1+218.g5f8507c Same error happens in CSIRO branch.

Thanks, Davide

petewa commented 6 years ago

Hi Davide,

The import failed message is not worded very well. You can safely ignore it. Non default drivers are skipped during initialization because they are not required. In this particular case, they are skipped because the s3 tables are not present before initialization. After you execute the 'datacube -v system init -s3', the tables should be properly created and initialized.

You can confirm by checking the following tables are created:

s3_dataset
s3_dataset_chunk
s3_dataset_mapping

Thanks, Peter

davidedelerma commented 6 years ago

Hi @petewa ,

I double checked, but unfortunately there are not new tables into the database. The only tables in the agdc schema are:

dataset
dataset_location
dataset_source
dataset_type
metadata_type

Thanks, Davide

petewa commented 6 years ago

Hi @davidedelerma, could you please try the 'datacube -v system init -s3' command with a fresh database.

The init script may not be set up to create the tables with an already initialised agdc schema.

davidedelerma commented 6 years ago

Hi @petewa , Thanks. I dropped the database and initialized it again using the "s3" option. Now I have the three extra tables. Then I added a product (Landsat 8) and indexed 9 Landsat 8 L2 images referring to the same footprint. After that I modified the configuration file for ingestion using:

container: 'rheaodc'
storage:
  driver: s3

but when I run the command: datacube --driver s3 -v ingest -c ~/ls8_collections_sr_general_s3.yaml I get the error:

2017-11-21 13:12:41,883 1935 datacube INFO Running datacube command: /home/ubuntu/Datacube/datacube_env/bin/datacube --driver s3 -v ingest -c /home/ubuntu/ls8_collections_sr_general_s3.yaml
2017-11-21 13:12:41,897 1935 DriverManager INFO Import Failed for Driver plugin "S3TestDriver", skipping.
2017-11-21 13:12:41,898 1935 DriverManager INFO Import Failed for Driver plugin "S3Driver", skipping.
Traceback (most recent call last):
  File "/home/ubuntu/Datacube/datacube_env/bin/datacube", line 11, in <module>
    load_entry_point('datacube', 'console_scripts', 'datacube')()
  File "/home/ubuntu/Datacube/datacube_env/lib/python3.5/site-packages/click-6.7-py3.5.egg/click/core.py", lin                                                               e 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/Datacube/datacube_env/lib/python3.5/site-packages/click-6.7-py3.5.egg/click/core.py", lin                                                               e 697, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/Datacube/datacube_env/lib/python3.5/site-packages/click-6.7-py3.5.egg/click/core.py", lin                                                               e 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/Datacube/datacube_env/lib/python3.5/site-packages/click-6.7-py3.5.egg/click/core.py", lin                                                               e 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/Datacube/datacube_env/lib/python3.5/site-packages/click-6.7-py3.5.egg/click/core.py", lin                                                               e 535, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/Datacube/agdc-v2/datacube/ui/click.py", line 199, in new_func
    return f(parsed_config, *args, **kwargs)
  File "/home/ubuntu/Datacube/agdc-v2/datacube/ui/click.py", line 226, in with_driver_manager
    validate_connection=expect_initialised) as driver_manager:
  File "/home/ubuntu/Datacube/agdc-v2/datacube/drivers/manager.py", line 78, in __init__
    self.set_current_driver(default_driver_name or self._DEFAULT_DRIVER)
  File "/home/ubuntu/Datacube/agdc-v2/datacube/drivers/manager.py", line 198, in set_current_driver
    driver_name, ', '.join(self.__drivers.keys())))
ValueError: Default driver "s3" is not available in NetCDF CF

Thanks, Davide

davidedelerma commented 6 years ago

Here is the full configuration file for ingestion:

source_type: ls8_collections_sr_scene
output_type: ls8_lasrc_general_s3

description: Landsat 8 USGS Collection 1 Higher Level SR scene proessed using LaSRC. Resampled to 30m EPSG:4326 projection with a sub degree tile size.

location: '/datacube/ingested_data'
file_path_template: 'LS8_OLI_LaSRC/General/LS8_OLI_LaSRC_4326_{tile_index[0]}_{tile_index[1]}_{start_time}.nc'
global_attributes:
  title: CEOS Data Cube Landsat Surface Reflectance
  summary: Landsat 8 Operational Land Imager ARD prepared by NASA on behalf of CEOS.
  source: LaSRC surface reflectance product prepared using USGS Collection 1 data.
  institution: CEOS
  instrument: OLI_TIRS
  cdm_data_type: Grid
  keywords: AU/GA,NASA/GSFC/SED/ESD/LANDSAT,REFLECTANCE,ETM+,TM,OLI,EARTH SCIENCE
  keywords_vocabulary: GCMD
  platform: LANDSAT_8
  processing_level: L2
  product_version: '2.0.0'
  product_suite: USGS Landsat Collection 1
  project: CEOS
  coverage_content_type: physicalMeasurement
  references: http://dx.doi.org/10.3334/ORNLDAAC/1146
  license: https://creativecommons.org/licenses/by/4.0/
  naming_authority: gov.usgs
  acknowledgment: Landsat data is provided by the United States Geological Survey (USGS).

container: 'rheaodc'
storage:
  driver: s3

  crs: EPSG:4326
  tile_size:
          longitude: 0.943231048326
          latitude:  0.943231048326
  resolution:
          longitude: 0.000269494585236
          latitude: -0.000269494585236
  chunking:
      longitude: 200
      latitude: 200
      time: 1
  dimension_order: ['time', 'latitude', 'longitude']

measurements:
    - name: coastal_aerosol
      dtype: int16
      nodata: -9999
      resampling_method: nearest
      src_varname: 'sr_band1'
      zlib: True
      attrs:
          long_name: "Surface Reflectance 0.43-0.45 microns (Coastal Aerosol)"
          alias: "band_1"
    - name: blue
      dtype: int16
      nodata: -9999
      resampling_method: nearest
      src_varname: 'sr_band2'
      zlib: True
      attrs:
          long_name: "Surface Reflectance 0.45-0.51 microns (Blue)"
          alias: "band_2"
    - name: green
      dtype: int16
      nodata: -9999
      resampling_method: nearest
      src_varname: 'sr_band3'
      zlib: True
      attrs:
          long_name: "Surface Reflectance 0.53-0.59 microns (Green)"
          alias: "band_4"
    - name: red
      dtype: int16
      nodata: -9999
      resampling_method: nearest
      src_varname: 'sr_band4'
      zlib: True
      attrs:
          long_name: "Surface Reflectance 0.64-0.67 microns (Red)"
          alias: "band_4"
    - name: nir
      dtype: int16
      nodata: -9999
      resampling_method: nearest
      src_varname: 'sr_band5'
      zlib: True
      attrs:
          long_name: "Surface Reflectance 0.85-0.88 microns (Near Infrared)"
          alias: "band_5"
    - name: swir1
      dtype: int16
      nodata: -9999
      resampling_method: nearest
      src_varname: 'sr_band6'
      zlib: True
      attrs:
          long_name: "Surface Reflectance 1.57-1.65 microns (Short-wave Infrared)"
          alias: "band_6"
    - name: swir2
      dtype: int16
      nodata: -9999
      resampling_method: nearest
      src_varname: 'sr_band7'
      zlib: True
      attrs:
          long_name: "Surface Reflectance 2.11-2.29 microns (Short-wave Infrared)"
          alias: "band_7"
    - name: 'pixel_qa'
      dtype: int32
      nodata: 1
      resampling_method: nearest
      src_varname: 'pixel_qa'
      zlib: True
      attrs:
          long_name: "Pixel Quality Attributes Bit Index"
          alias: [pixel_qa]
    - name: 'aerosol_qa'
      dtype: uint8
      nodata: 0
      resampling_method: nearest
      src_varname: 'sr_aerosol'
      zlib: True
      attrs:
          long_name: "Aerosol Quality Attributes Bit Index"
          alias: [sr_aerosol_qa, sr_aerosol]
    - name: 'radsat_qa'
      dtype: int32
      nodata: 1
      resampling_method: nearest
      src_varname: 'radsat_qa'
      zlib: True
      attrs:
          long_name: "Radiometric Saturation Quality Attributes Bit Index"
          alias: [radsat_qa]
    - name: 'solar_azimuth'
      dtype: int16
      nodata: -32768
      resampling_method: nearest
      src_varname: 'solar_azimuth_band4'
      zlib: True
      attrs:
          long_name: "Solar Azimuth Angle for Band 4"
          alias: [solar_azimuth_band4]
    - name: 'solar_zenith'
      dtype: int16
      nodata: -32768
      resampling_method: nearest
      src_varname: 'solar_zenith_band4'
      zlib: True
      attrs:
          long_name: "Solar Zenith Angle for Band 4"
          alias: [solar_zenith_band4]
    - name: 'sensor_azimuth'
      dtype: int16
      nodata: -32768
      resampling_method: nearest
      src_varname: 'sensor_azimuth_band4'
      zlib: True
      attrs:
          long_name: "Sensor Azimuth Angle for Band 4"
          alias: [sensor_azimuth_band4]
    - name: 'sensor_zenith'
      dtype: int16
      nodata: -32768
      resampling_method: nearest
      src_varname: 'sensor_zenith_band4'
      zlib: True
      attrs:
          long_name: "Sensor Zenith Angle for Band 4"
          alias: [sensor_zenith_band4]

petewa commented 6 years ago

Hi @davidedelerma,

I made a mistake in my earlier post. "Import Failed for Driver plugin "S3Driver", skipping." Should not be happening. This means that there was an import error when loading the the S3Driver.

In relation to my earlier message. I misread it as "Driver plugin "S3Driver" failed requirements check, skipping" which is different.

It looks like you are on EC2 with a flavor of Ubuntu. What version of Ubuntu? I've been using 'Ubuntu 16.04.1 LTS' personally. Can you please post the output from pip freeze.

The following should give you the proper environment:

conda config --prepend channels conda-forge
conda config --prepend channels conda-forge/label/dev
conda create -n agdc python=3.6
conda env update -n agdc --file datacube-core/.travis/environment.yaml
source activate agdc

petewa commented 6 years ago

Greetings @davidedelerma

How did you go?

adriaat commented 6 years ago

Hello all,

@davidedelerma and I have been working on it. We both use Ubuntu 16.04.03 LTS. We created a new environment following @petewa commands from a previous comment. Using this environment, we ingested a single time slice of Landsat 8 data in a S3 bucket. There was no error in the ingestion. We loaded this data with use_threads=False and we did not have any problem: the data is correctly loaded, even though it is slow. Surprisingly, the instance we used had 4 vcpu, and the 4 of them were being used in the ingestion process. On the other hand, we also tried loading the data with use_threads=True. In this case, the jupyter notebook never finishes (at least for a reasonable amount of waiting time) loading the data, and the consumption of the vcpus is nearly 0.

In the same environment, we ingested two time slices of Landsat 8 data. We loaded these data using use_threads=False and use_threads=True, and in both cases we were able to load it. Afterwards, we ingested more data and we had in total 9 time slices of Landsat 8 data. We loaded a region of interest of these 9 time slices, due to limitations with memory, but we did not have any problem as well.

We tested all of the above with both develop and csiro/execution-engine branches, where the two yield datacube version 1.5.1.

At this point, we ask ourselves

What are the advantages of storing the ingested data as objects in S3 with respect to using S3 buckets as a file system for ingested data by means of software that allows to mount S3 buckets, e.g. s3fs-fuse?
How does use_threads actually paralellise the data access from S3? Since when we used use_threads=False we saw that our all vcpus were being used.

Elaborating on the point 2, we did a small benchmark of the elapsed time when loading the ingested data for a region of interest for the 9 time slices, when ingested in the S3 using the s3 driver and when ingested locally, and also testing the develop and the csiro/execution-engine branches. The results show the average time over 3 runs.

	Branch develop	Branch csiro/execution-engine
In S3 with s3 driver	`use_threads=False`: 33.24 s, `use_threads=True`: 25.29 s	`use_threads=False`: 83.08 s, `use_threads=True`: 72.64 s
In EC2	`use_threads=False`: 6.71 s, `use_threads=True`: 6.45 s	`use_threads=False`: 6.63 s, `use_threads=True`: 6.22 s

We expected some problem if we loaded the ingested data in EC2 using use_threads=True, given a previous comment. However, it works and it is faster! On the other hand, we are also surprised to see that the csiro/execution-engine branch is faster when loading the ingested data in EC2 but much slower when the data is in S3.

There is something else that may be worth sharing. The environment we were using before was the one created by following the installation guide for Open Data Cube in the www.opendatacube.org website: Data Cube Installation Guide. However, when using the environment created with the environment file datacube-core/.travis/environment.yaml, there is an error with the postgres database that before we did not have:

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server: No such file or directory
    Is the server running locally and accepting
    connections on Unix domain socket "/tmp/.s.PGSQL.5432"?

Before we did not have a db_hostname in the .datacube.conf file. To solve this error, we had to set either db_hostname: localhost or db_hostname: /var/run/postgresql in the .datacube.conf file.

petewa commented 6 years ago

One thing that comes to mind is latency and compute performance. There are a bunch of reasons, main one being AWS architects tends steer us away from using s3 as a file system e.g. s3fs
use_threads controls the threading on retrieving multiple time slices and stitching datasets. This is currently a compromise. Having said that, the underlying S3IO library is threaded by default.

Regarding your benchmarks, with a 200x200 chunking size, I'm not surprised with the results you are seeing.

Try these parameters for S3 as a starting point and tweak from there.

change chunking to 2000x2000
change tile size and resolution such that a tile is roughly 8000x8000 or more

Example LS5 ingest config. https://github.com/opendatacube/datacube-core/blob/develop/docs/config_samples/ingester/ls5_nbar_albers_s3.yaml

I aim for objects around ~10-20 MB in size. You can look in your S3 bucket to see this, post ingest.

use_threads should not make a difference when ingesting to EBS with the default netcdf driver. use_threads should make a big difference when ingesting to EBS with the s3-file driver.

The csiro/csiro-execution engine branch tries to balance performance with memory consumption. It scales the number of threads w.r.t to the number of cores. It's something like: threads = ncores*2 for stitching/stacking. processes = ncores for pulling data from s3.

adriaat commented 6 years ago

Hello,

Thank you for your reply @petewa. I played a little bit with the suggested changes and with the two branches (develop and csiro/execution-engine), producing benchmarks to qualify and quantify the results. Since there are just too many results to post them here, I give an overall impression of them. The comments below refer to loading times of 9 Lansat 8 time slices, sharing the comments for both branches (unless specified otherwise) and in an Amazon EC2 t2.xlarge instance (vcPU: 4, memory: 16 GiB). Also, please note that when I refer to "tile size" I am referring to the ratio "tile size/resolution" (I did not change the resolution), and when referred to a chunking parameter, the chunks are squared (therefore a chunking parameter of 200 stands for 200x200).

If only the chunking parameter is changed from the original 200 to 2000: the loading time is much faster (40%-85% faster!), specially if use_threads=True.
If only the tile size is changed from 3500 to 8000: it is faster, but much slower than only changing the chunking parameter from 200 to 2000.
If both the tile size is changed from 3500 to 8000 and the chunking parameter from 200 to 2000: it is faster, but the loading times are similar to only changing the chunking parameter, as if the tile size did not have an enhancing effect.
Having played with the above parameters, the largests files in the bucket are slighly larger than 5 MB. I guess other parameters should be used to obtain objects around 10-20 MB in size.
When ingesting to the EBS, there is not a clear difference with use_threads with either the default NetCDF driver or the S3-file driver. The way I ingest to the EBS with the S3-file driver is the following:
- For each ingestion configuration file (so different tile sizes and chunkings) I set driver: s3 in the storage: list, i.e.
```
storage:
driver: s3
crs: EPSG:4326
tile_size:
    longitude: 0.943231048326
    latitude:  0.943231048326
resolution:
    longitude: 0.000269494585236
    latitude: -0.000269494585236
chunking:
    longitude: 200
    latitude: 200
    time: 1
dimension_order: ['time', 'latitude', 'longitude']
```
- Ingest with the usual command, datacube -v ingest -c ~/ls8_test.yaml. If I run datacube --driver s3 -v ingest -c ~/ls8_test.yaml, I get the error datacube.utils.DatacubeException: Missing 'container' parameter, cannot write to storage. I understand that, if I set the container parameter, the data will be ingested to the S3 bucket rather than into the EBS. Am I right?
When ingesting to S3, the branches develop and csiro/execution-engine result approximately equal in timewise terms, except when the chunking parameter is set small, i.e. 200, regardless of the tile size. In these cases, the differences are like the ones in the table of my previous comment.
The fastest loading times from data ingested in the S3 bucket is given by the branch develop, with use_threads=True, and with chunking set to 2000, regardless of the tile size. It is also quicker to load data chunked to 2000 with use_threads=False than data chunked to 200 and use_threads=True.

Now everything makes more sense. @petewa says in the previous comment that use_threads should make a big difference when ingesting to EBS with the S3-file driver, in contrast with using the default NetCDF driver. However, I do not experience this behaviour and it may come from my side. Did I ingest the files correctly to the EBS using the S3 driver? I only have NetCDF files in the ingestion directory specified for the S3-file driver, is that OK?

rtaib commented 6 years ago

Thanks for benchmarking and providing such a nice summary @adriaat ! I think @petewa may be travelling right now, so let me answer your question about s3-file ingestion: yes, you need to define the container field in the config file, the same way you would for normal s3 ingestion, the idea being that a container is a driver-independent location for data. What you put there is the absolute path where the data will be stored. Set this very carefully if running as root. Then, you definitely need the --driver s3 flag during ingestion. By doing so, you should NOT see any NetCDF files in the ingestion directory as you do now (when the flag is not specified, it will default to NetCDF). That should also change your ingestion timings, hopefully for the better.

petewa commented 6 years ago

Hi @adriaat, I'm back in the office now.

Performance tips:

more cores generally means more performance for the s3 and s3-file driver
- I test with a c4.2xlarge, even more performance if you go bigger. e.g. c4.4xlarge, c4 8xlarge.
- I found t1/t2 instance performance to s3 to be on the low side.
Go big on tile size
Go bigger on chunking parameter to arrive at 10-20MB in size.
After ingest to S3, it takes around 1-2 hours for s3 performance to stabilise.

I noticed the way you ingest to EBS eith the s3-file driver is incorrect which is probably why you are still seeing netcdf files.

The driver should be s3-test.

E.g.

container: '/home/ubuntu/data/output' storage: driver: s3-test

The container parameter must be set for s3 or s3-file ingest.

I did some performance tests for a first attempt of balancing performance vs memory. I'm sharing this as it's a good indication of real-life performance. As you can see performance varies quite a bit depending on the instance type and threading parameters.

These numbers are using the example ingest yaml, which has not been tuned and results in a small object size of 1-5MB. Performance should be better with a slightly bigger object size around 10-20MB.

What you don't see from below is, it's pulling back all bands, not just 1.

=================================

(time: 12, x: 421, y: 490) 30MB nbar = dc.load(product='ls5_nbar_albers', x=(149.25, 149.35), y=(-35.25, -35.35), use_threads=True)

c4.large s3 4.2s c4.xlarge s3 2.3s c4.2xlarge s3 1.3s c4.4xlarge s3 0.9s c4.8xlarge s3 0.9s

read performance from s3 varies from 0.9s to 4.2s.

=================================

(12, 840, 980) 119MB nbar = dc.load(product='ls5_nbar_albers', x=(149.15, 149.35), y=(-35.15, -35.35)) netcdf 2.641545412540436 s3 2.600525255203247

c4.large s3 9.8s c4.xlarge s3 5.5s c4.2xlarge s3 2.5s c4.4xlarge s3 1.7s c4.8xlarge s3 1.4s

read performance from s3 varies from 1.4s to 9.8s.

=================================

(time: 12, x: 1680, y: 1959) 500MB nbar = dc.load(product='ls5_nbar_albers', x=(149.15, 149.55), y=(-35.15, -35.55), use_threads=True, loops=100) netcdf 7.269922971725464 s3 4.752055556774139 s3-file 2.585359718799591

c4.large s3 22s c4.xlarge s3 11.4s c4.2xlarge s3 5.6s c4.4xlarge s3 3.1s c4.8xlarge s3 2.6s

read performance from s3 varies from 2.6s to 22s.

The performance varies is purely in the number of IO streams possible.

=================================

(time: 12, x: 3778, y: 4408) 1.2GB nbar = dc.load(product='ls5_nbar_albers', x=(149.05, 149.95), y=(-35.05, -35.95), use_threads=True)

c4.large s3 out of memory c4.xlarge s3 out of memory c4.2xlarge s3 11.5s c4.4xlarge s3 6.6s c4.8xlarge s3 5.2s r4.16xlarge s3 7s

=================================

(12, 840, 980) 119MB nbar = dc.load(product='ls5_nbar_albers', x=(149.15, 149.35), y=(-35.15, -35.35))

c4.xlarge 4 cores 1.25GB 8/16 6.7s 0.88GB 8/8 6.1s 0.86GB 16/8 5.5s 0.60GB 16/4 6.5s 0.61GB 32/4 6.5s 1.51GB 32/30 5.6s

c4.2xlarge 8 cores 1.57GB 32/30 2.9s 0.87GB 32/8 3.8s 1.28GB 32/16 2.8s 1.22GB 16/16 2.7s 0.90GB 16/8 3.2s 1.36GB 8/16 3.7s

Here you can see the different amount of threads/cores and the effect it has on memory consumption (0.60 - 1.57GB) and read performance (2.7s - 6.7s) on c4.xlarge and c4.2xlarge.

(time: 12, x: 1680, y: 1959) 500MB nbar = dc.load(product='ls5_nbar_albers', x=(149.15, 149.55), y=(-35.15, -35.55), use_threads=True, loops=100)

c4.2xlarge 8 cores 1.66GB 8/16 6.8s 1.6GB 16/16 5.6s 1.21GB 16/8 6.3s 1.19GB 32/8 6.3s 1.62GB 32/16 5.6s

(time: 12, x: 3778, y: 4408) 1.2GB nbar = dc.load(product='ls5_nbar_albers', x=(149.05, 149.95), y=(-35.05, -35.95), use_threads=True)

2xcores/1xcores

c4.8xlarge 36 cores 7.44G 64/64 5.3s 7.6G 32/64 4.5s 6.86G 16/64 6s 5.96G 32/32 4.64s 5.12GB 32/16 6.3s

c4.2xlarge 8 cores 4.78G 32/16 12s 4.9G 32/32 12s 4.34G 32/8 13s 3.66G 16/8 13s 3.32G 8/8 16s 3.79G 8/16 16s

woodcockr commented 6 years ago

Looks like this thread is finished for now so I will close it. For those interested we have a paper with performance benchmark testing for the S3 driver coming out in a few weeks. We’ll share the results with the broader community. There’s also been some mergers and changes on the S3 array IO work to better integrate it with the computational environments for both HPC and AWS EC2 clusters. A few defects to resolve after recent mergers but should be updated soon on the develop branch.

palmoreck commented 6 years ago

Hi everyone, in the last release of branch develop: https://github.com/opendatacube/datacube-core/blob/develop/datacube/scripts/system.py there's no more -s3 option in the datacube -v system init command, how can we get the three tables s3_dataset , s3_dataset_chunk and s3_dataset_mapping ? Thanks in advance

petewa commented 6 years ago

Hi @palmoreck,

The develop branch altered the way configurations are done: see http://datacube-core.readthedocs.io/en/latest/user/config.html

This is one way to do it:

example .datacube.conf

[user]
#default_environment: datacube
default_environment: s3aio_env

[datacube]
db_hostname: <host_name>
db_database: <db_name>
db_username: <db_username>

[s3aio_env]
db_hostname: <host_name>
db_database: <db_name>
db_username: <db_username>

index_driver: s3aio_index

You can use the default_environment to set the default environment you wish to use datacube or s3aio_env in this example.

After editing .datacube.conf as above, you can do a datacube -v system init and it will pick up the default configuration set in .datacube.conf and create the relevant tables.

The s3 driver code in develop is a little stale, and is missing one important fix csiro/develop-fix-drivers and other improvements in csiro/execution-engine

palmoreck commented 6 years ago

great¡ @petewa thank you. I'll test it.

what are the major changes in the fixes/improvements in those branches? thanks...

petewa commented 6 years ago

@palmoreck

csiro/develop-fix-drivers implements a thread-safety fix for the driver caches implemented in develop. It's ok if you do not do multi-threaded/multi-process operation. It's dependent on your application. This is a bandaid fix, and develop should be reworked slightly to support multi-threaded/multi-process operation properly.

csiro/execution-engine has general improvements like, including geo-coordinates on array.save() and array.load() (this is not using the Datacube Ingest mechanism), tweaking memory consumption vs performance, Datacube.save() (This basically does ingest without the need for yaml files) that works with Datacube.load() and general refactoring.

palmoreck commented 6 years ago

thank you for the information @petewa 👍

palmoreck commented 6 years ago

Maybe I'm missing something because when I execute:

$datacube -E s3aio_env system check

I get:

/usr/local/lib/python3.5/dist-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
  """)
Version:       1.6rc1
Config files: /my-user/.datacube.conf
Host:         <my-host>
Database:      datacube_s3
User:          <my-user>
Environment:   None
Index Driver:  s3aio_index

Valid connection:   2018-06-21 17:57:25,148 953 datacube.drivers.driver_cache WARNING Failed to resolve driver datacube.plugins.index::s3aio_index
Traceback (most recent call last):
  File "/usr/local/bin/datacube", line 11, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 87, in augment_usage_errors
    yield
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/datacube/ui/click.py", line 192, in new_func
    return f(parsed_config, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/datacube/scripts/system.py", line 91, in check
    index = index_connect(local_config=local_config)
  File "/usr/local/lib/python3.5/dist-packages/datacube/index/_api.py", line 38, in index_connect
    driver_name, len(index_drivers()), ', '.join(index_drivers())
RuntimeError: No index driver found for 's3aio_index'. 1 available: default

Do you know what I could be missing?

My team and I are working with a Dockerfile for our application where we have some lines like:

...
RUN pip3 install --upgrade pip==9.0.3
RUN pip3 install dask distributed bokeh 
##Install missing package for open datacube:
RUN pip3 install --upgrade python-dateutil
#Dependencies for  datacube
RUN pip3 install numpy && pip3 install GDAL==$(gdal-config --version) --global-option=build_ext --global-option='-I/usr/include/gdal' && pip3 install rasterio==1.0b1 --no-binary rasterio  
RUN pip3 install scipy cloudpickle sklearn lightgbm fiona django boto3 SharedArray pathos zstandard --no-binary fiona 
#datacube:
RUN pip3 install datacube==1.6rc1
...

Thanks in advance

palmoreck commented 6 years ago

ok, i think i need to install datacube like:

pip3 install git+https://github.com/opendatacube/datacube-core.git@develop

so i have now:

$datacube --version
Open Data Cube core, version 1.6rc1+106.gef03ad2.dirty

$datacube -E s3aio_env system check
Version:       1.6rc1+106.gef03ad2.dirty
Config files:  /my-user/.datacube.conf
Host:          <my-host>
Database:      datacube_s3
User:          <my-user>
Environment:   None
Index Driver:  s3aio_index

Valid connection:   Database not initialised:

No DB schema exists. Have you run init?
    datacube system init

(I also installed psycopg2-binary)

Kirill888 commented 6 years ago

@palmoreck need to install with [s3] feature flag

pip3 install datacube[s3]

or for latest from github

pip3 install git+https://github.com/opendatacube/datacube-core.git@develop#egg=datacube[s3]

palmoreck commented 6 years ago

ok, thanks @Kirill888 @loicdtx is asking a question in slack regarding GridWorkflow. I hope we can solve that. (That question still uses 1.6rc1+106.gef03ad2.dirty version of datacube)

petewa commented 6 years ago

I posted a response in slack. I haven't seen that error before. I tested it with the csiro/execution-engine branch and list_cells works fine. For some reason grid_spec is None.

palmoreck commented 6 years ago

Hello, thanks @petewa

We are trying to install datacube branch csiro/execution-engine with index support for s3.

If we install it with:

$pip3 install git+https://github.com/opendatacube/datacube-core.git@csiro/execution-engine#egg=datacube[s3]

we get:

Collecting datacube[s3] from git+https://github.com/opendatacube/datacube-core.git@csiro/execution-engine#egg=datacube[s3]
  Cloning https://github.com/opendatacube/datacube-core.git (to csiro/execution-engine) to /tmp/pip-build-h6nqnek6/datacube
error: Your local changes to the following files would be overwritten by checkout:
    docs/config_samples/dataset_types/ls_usgs_sr_scene.yaml
Please, commit your changes or stash them before you can switch branches.
Aborting
Command "git checkout -q 8dac2e59066ef06ebe66850ec9e272b3c0c336da" failed with error code 1 in /tmp/pip-build-h6nqnek6/datacube

And if we install it like:

$git clone -b csiro/execution-engine https://github.com/opendatacube/datacube-core.git /my-user/datacube-core && pip3 install /my-user/datacube-core/.[s3]

datacube is installed but when executing $datacube system init --no-init-users we get no support for s3 index driver :

2018-06-25 17:25:06,512 22 datacube.drivers.driver_cache WARNING Failed to resolve driver datacube.plugins.index::s3aio_index
Traceback (most recent call last):
  File "/usr/local/bin/datacube", line 11, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 87, in augment_usage_errors
    yield
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/datacube/ui/click.py", line 192, in new_func
    return f(parsed_config, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/datacube/ui/click.py", line 217, in with_index
    validate_connection=expect_initialised)
  File "/usr/local/lib/python3.5/dist-packages/datacube/index/_api.py", line 38, in index_connect
    driver_name, len(index_drivers()), ', '.join(index_drivers())
RuntimeError: No index driver found for 's3aio_index'. 1 available: default

How we can manage to install datacube from csiro/execution-engine branch with driver index s3 support?? Thanks in advance

petewa commented 6 years ago

I just tried this out, and it seems to work. I left out the apt-get installs.

conda create -n dc36 python=3.6
source activate dc36
git clone git@github.com:opendatacube/datacube-core.git
cd datacube-core/
git checkout -f csiro/execution-engine
pip install '.[test,celery,s3]'
createdb -U <user_name> -h <host_name> <db_name>
datacube -v system init

There are some additions/changes to .datacube.conf

[user]
#default_environment: datacube
default_environment: s3aio_env

[datacube]
db_hostname: <host_name>
db_database: <db_name>
db_username: <user_name>

redis.host:
redis.port: 6379
redis.db: 0
redis.password:

redis_celery.host:
redis_celery.port: 6379
redis_celery.db: 1
redis_celery.password:

execution_engine.result_bucket: eetest2
execution_engine.use_s3: False

[s3aio_env]
db_hostname: <host_name>
db_database: <db_name>
db_username: <user_name>
index_driver: s3aio_index

redis.host:
redis.port: 6379
redis.db: 0
redis.password:

redis_celery.host:
redis_celery.port: 6379
redis_celery.db: 1
redis_celery.password:

execution_engine.result_bucket: eetest2
execution_engine.use_s3: False

The default_environment variable in .datacube.conf is what datacube system init picks up when initialising the postgres database.

palmoreck commented 6 years ago

Hi,

Thank you for your answers, we really appreciate them. I don't want to be a headache nor being negative but I copy-pasted the lines of @petewa in our Dockerfile with little changes (for example for cloning as I dont' have ssh access and pip3):

RUN git clone https://github.com/opendatacube/datacube-core.git && cd datacube-core/ && git checkout -f csiro/execution-engine && pip3 install '.[test,celery,s3]'

(output of this RUN):

Successfully installed OWSLib-0.16.0 SharedArray-3.0.0 amqp-2.3.2 astroid-1.6.5 atomicwrites-1.1.5 billiard-3.5.0.3 boto3-1.4.3 cachetools-2.1.0 celery-4.2.0 cf-units-1.2.0 cffi-1.11.5 cftime-1.0.0 compliance-checker-4.0.1 coverage-4.5.1 cython-0.28.3 datacube-1.6rc1+145.g8dac2e5 dill-0.2.8.2 graphviz-0.8.3 hypothesis-3.64.0 isodate-0.6.0 isort-4.3.4 jsonschema-2.6.0 kombu-4.2.1 lazy-object-proxy-1.3.1 lxml-4.2.2 mccabe-0.6.1 mock-2.0.0 more-itertools-4.2.0 multiprocess-0.70.6.1 netcdf4-1.4.0 objgraph-3.4.0 pandas-0.23.1 pathos-0.2.2 pbr-4.0.4 pendulum-2.0.2 pluggy-0.6.0 pox-0.2.4 ppft-1.6.4.8 py-1.5.3 pycodestyle-2.4.0 pycparser-2.18 pygeoif-0.7 pylint-1.9.2 pypeg2-2.15.2 pyproj-1.9.5.1 pytest-3.6.2 pytest-cov-2.5.1 pytest-timeout-1.3.0 pytzdata-2018.5 redis-2.10.6 regex-2017.7.28 s3transfer-0.1.13 singledispatch-3.4.0.3 sqlalchemy-1.2.8 tabulate-0.8.2 vine-1.1.4 wrapt-1.10.11 xarray-0.10.7 zstandard-0.9.1

used the same .datacube.conf (structure) but when doing the init, it says we don't have support for s3 driver:

$ datacube system init
2018-06-26 19:07:49,098 32 datacube.drivers.driver_cache WARNING Failed to resolve driver datacube.plugins.index::s3aio_index
Traceback (most recent call last):
  File "/usr/local/bin/datacube", line 11, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 87, in augment_usage_errors
    yield
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/datacube/ui/click.py", line 192, in new_func
    return f(parsed_config, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/datacube/ui/click.py", line 217, in with_index
    validate_connection=expect_initialised)
  File "/usr/local/lib/python3.5/dist-packages/datacube/index/_api.py", line 38, in index_connect
    driver_name, len(index_drivers()), ', '.join(index_drivers())
RuntimeError: No index driver found for 's3aio_index'. 1 available: default

maybe we're missing something but we can't figure out what it is. Please see our apt-get installations:

FROM ubuntu:xenial
USER root
#see: https://github.com/Yelp/dumb-init/ for a justification of next line:
RUN apt-get update && apt-get install -y wget curl && wget -O /usr/local/bin/dumb-init https://github.com/Yelp/dumb-init/releases/download/v$(curl -s https://api.github.com/repos/Yelp/dumb-init/releases/latest| grep tag_name|sed -n 's/  ".*v\(.*\)",/\1/p')/dumb-init_$(curl -s https://api.github.com/repos/Yelp/dumb-init/releases/latest| grep tag_name|sed -n 's/  ".*v\(.*\)",/\1/p')_amd64 && chmod +x /usr/local/bin/dumb-init

#dependencies
RUN apt-get update && apt-get install -y \
    openssh-server \
    openssl \
    sudo \
    wget \
    nano \
    software-properties-common \
    python-software-properties \
    git \
    vim \
    vim-gtk \
    htop \
    build-essential \
    libssl-dev \
    libffi-dev \
    cmake \
    python3-dev \
    python3-pip \
    python3-setuptools \
    ca-certificates \
    postgresql-client \
    awscli \
        pip3 install --upgrade pip==9.0.3

#Install spatial libraries
RUN add-apt-repository -y ppa:ubuntugis/ubuntugis-unstable && apt-get -qq update
RUN apt-get install -y \
    netcdf-bin \
    libnetcdf-dev \
    ncview \
    libproj-dev \
    libgeos-dev \
    gdal-bin \
    libgdal-dev

##Install dask distributed
RUN pip3 install dask distributed --upgrade && pip3 install bokeh

##Install missing package for open datacube:
RUN pip3 install --upgrade python-dateutil

#Dependencies for datacube and app
RUN pip3 install numpy && pip3 install GDAL==$(gdal-config --version) --global-option=build_ext --global-option='-I/usr/include/gdal' && pip3 install rasterio==1.0b1 --no-binary rasterio
RUN pip3 install scipy cloudpickle sklearn lightgbm fiona django --no-binary fiona
RUN pip3 install --no-cache --no-binary :all: psycopg2

#Next for compliance-checker installation
RUN apt-get clean && apt-get update && apt-get install -y locales
RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LC_ALL en_US.UTF-8

#datacube:
RUN git clone https://github.com/opendatacube/datacube-core.git && cd datacube-core/ && git checkout -f csiro/execution-engine && pip3 install '.[test,celery,s3]'

....(next lines invole installations for our app).....

Thanks for your support!

petewa commented 6 years ago

Hi @palmoreck,

I'm happy to help, we can have a webex hookup or similar and I can walk you through the process. This might be the fastest way to resolve this. Please pm me on slack peterw

I can't see anything wrong at first glace. The error comes from the IndexDriverCache not having the s3aio_index. This is set in setup.py. i.e. https://github.com/opendatacube/datacube-core/blob/csiro/execution-engine/setup.py Since you are cloning the repo fresh, this shouldn't be the issue.

There are some differences in the apt libraries and apt-repository you are using. Try using this: https://github.com/opendatacube/datacube-core/blob/csiro/execution-engine/Dockerfile as a basis.

Failing that, is there another old datacube installed in pip/pip3? pip uninstall datacube pip3 uninstall datacube before pip3 install '.[test,celery,s3]'

Another thing to make sure is you have .aws/credentials file in your home directory.