qld-gov-au / ckanext-s3filestore

Use Amazon S3 as a filestore for CKAN
GNU Affero General Public License v3.0
6 stars 4 forks source link

Suggestions to improve the documentation. #112

Open vabatista opened 8 months ago

vabatista commented 8 months ago

Here are some suggestions to improve the documentation.

Our CKAN installation was done on the Amazon AWS cloud infrastructure. The main CKAN application runs in a Fargate/ECS container. In this case, there were some missing instructions on how to proceed with the installation, but we realized that it was enough to follow the following steps:

  1. git clone s3filestorage repo
  2. pip install boto3 && python setup.py install
  3. add s3filestorage in ckan.plugin list

We encountered an issue with a self-signed certificate. Our "solution" was to edit the source code and insert verify=False in the get_s3_resource and get_s3_client methods. We will address this properly later.

Another issue we faced was that our company policy prohibits ACL in S3 buckets, so it is disabled. Consequently, updating ACL after file upload causes errors. We had to modify the source code to prevent this.

ThrawnCA commented 8 months ago

@vabatista It looks to me like the problem is that we only have instructions for installing from PyPI, not for installing this fork. Following the usual procedure for a source installation, boto3 would be installed from the requirements file.

I'll update that.

When you say s3filestorage, do you mean s3filestore?

As for self-signed certificates, I would suggest adding the certificate to the recognised authorities on your CKAN instance.

ThrawnCA commented 8 months ago

@vabatista How does the updated README look?

duttonw commented 8 months ago

Hi @vabatista

Could you provide us a PR with the option to turn off the ACL update's where the S3 bucket has ACL disabled on account or bucket level.

Yes we know about the ACL issue and it being against CIS benchmark. We originally had this was set to priviate and we always returned this seemed fine but then it started to end up in search results and people were unhappy the deep linked public assets could not be clicked on post indexing.

We have not found a good solution to mixed private datasets and public datasets where an author wants to pull access to the connected resource objects without deleting them. Having S3 bucket set to private and then using cloudfront OAC/OAI but then if you know the object url then you can download them which is no good for private (secure author only viewing assets).

Do you have any ways forward on this matter?