Open gubuntu opened 2 years ago
investigating options: There are a number of parameters that can be taken into account comparing CloduSQL (managed PostgreSQL instance), PostgreSQL running inside a VM instance, and docker-PostGIS), few i can think of are:
Note: as the number of users isn’t expected to scale to the level of data, thus scalability isn’t taken as a comparison parameter.
Simplicity: The interaction model with managed cloud sql is through console, api’s, and cli tools, for non-technical staff of WRO, this is the simplest option as the console makes it easier to interact with the database and provision resources, from a development side CloudSQL is easier to integrate with other parts of the WRO system that relies on other GCP services(GKE, BigQuery), cloudSQL is more simpler than other two.
Security: For privacy, this is a matter of service-level agreement, for cloud sql google insures data encryption at rest and transit, as we are hosting CKAN within the same region of cloud sql, the connection can be done through Private IP connection which adds a security layer, though from a connection point, having PostGres installed within the same VM as CKAN can be more secure, for docker-PostGIS, it relates to how are we deploying it, there is no need to scale for admin db thus kubernetes is not an option here, which means we are connecting a docker container with CKAN in the same VM, from other side having the database with application in the same VM presents a single point of failure.
Extensibility: All three options can be extended with PostGIS, but both docker-PostGIS and a Postgres running in a VM are more customizable than the managed service, this the strongest case for these solutions.
Features: CloudSQL comes with a lot of features out of the box, most notably failover(automatic backups), autoscaling, patches and updates are automatically applied and others, the same can be achieved with the other two solutions with a bit of configuration.
Thus I think we should go with CloudSQL (managed service).
Investigating Cloud storage with CKAN: CKAN stopped supporting file uploads to Cloud Storage providers since version 2.2 (Feb 2014), and uses CKAN “FileStore” local storage instead, files can be uploaded to the local storage through actions resource_create() and/or resource_update(), for cloud storage we can either extend CKAN or use a plugin (e.g. ckanext-cloudstorage), the suggested method is to begin with the second option and loop back to the first. ckanext-cloudstorage supports all the cloud providers supported by the open source libcloud python package (https://libcloud.apache.org/) which includes GCP cloud storage, thus working on testing this after going through issue#3 (https://github.com/kartoza/WRODataPlatform/issues/3) == creating the metadata Form.
based on meeting feedback today, please do attempt to implement storage into GCS folders.
A possible scenario: In the metadata form, fetch list of available folders for user to choose. This then gets stored in the metadata but is also used during the upload to direct the file storage location
the current implementation takes WRO themes into account for dynamic directory creation, as a user inputs a Dataset Topic Category (Agriculture, Biodiversity, ..etc.) , that category gets created as a GCS directory, and resources are stored within those directories, note: GCS doesn't hold directories in the technical sense, rather these structures and paths found within GCS are visualizations of the objects stored (see https://cloud.google.com/storage/docs/folders).
@Mohab25 please document behaviour in all practical situations of uploading files so that we have clear procedures about file naming conventions, how to avoid overwriting or inadvertently deleting files, differentiating between apparent duplicates (same 'folder' and file name but different object ID) etc.
@gubuntu for uploading / deletion situation, as ckanext-cloudstorage has a custom behavior and also uses underlying apache-libcloud (https://libcloud.apache.org/) to control cloud objects (including other vendors beside google), and as ckan uses certain flow to upload / delete resources, i had to re-write the logic in a custom extension for uploading / deletion in-order to have a stable flow, now we are using GCS python client lib to upload/delete objects from bucket (https://cloud.google.com/storage/docs/uploads-downloads).
for name resolution, instead of generating a random string, we now use the same resource id in the database which is guaranteed to be unique, the mapping of the name is as [resourceName_id_resourceName.ext , for example shapefile1_id_3dsfsdfss.shp], this way we can keep a connection between the cloud object with the ckan database, and separate objects with the same name.
@Mohab25 the above sounds good. Is there now a platform available that we can test the data upload facility? http://130.211.222.159/ is not linking to the ckan interface.
@mikev3003 hi Michael, i still don't have a permission to create a private google cloud sql instance (see the highlight in the screenshot). i want to be able to create a private connection, also i want to be able to create a service account (https://cloud.google.com/iam/docs/service-accounts) with GCP to connect ckan to your cloud storage instance.
@Mohab25 I have changed your role to 'owner' could you please check if you are able to do it now? Otherwise I suggest we jump on a call together and go through the permission options to get you what you need.
@mikev3003 Hello professor, OK, i've scheduled a meeting see if that a good timing for you? also sent you an email with some credentials for ckan site admin, give http://130.211.222.159/ a check.
hello professor, it seems you weren't able to attend the call, to summarize, i need a new cloudsql (postgres) instance, with the following configs:
after that we can try to connect ckan with the new one.
@Mohab25 I've given @gubuntu owner rights - hopefully he can assist with the needed permissions
@Mohab25 data upload forms are looking good. Could you please assist with minor editing as indicated in these pdfs?(https://www.dropbox.com/t/y8xwr9Z1pkk3mZHo) Other comments:
@mikev3003 received, i'm going through these comments one by one, i can tell from the first that you didn't like the first letters capitalized, this is handled by now.
@Mohab25 I do not have the "add datasets" option under the datasets tab. I registered using e-mail address cindy.viviers63@gmail.com, username cindels63. Pretty plz.
@Cindels63 hello, this is the intended behavior i was boring you about during the last meeting ^_^, ask @mikev3003 to add you as an editor of one of the organizations. Otherwise if you to be a super admin i can manage that.
@Mohab25 with this, my 2c and experience in uploading a typical dataset in hydrogeology. • In terms of loading 3rd party data – one cannot add an origination which isn’t already listed. Maybe one can add one if one isn’t already registered – especially because it is an compulsory field to complete; also suggest adding the ‘Council for Geoscience’ and ‘Department of Water and Sanitation’ • Recommend data reference date should only come up if data is not static (time-series), or at least not be compulsory • Dataset extends across RSA, so after receiving a geographic coordinate error for putting in my own coordinates, I just copied and pasted the example coordinates. Still got error: Geographic location or bounding box coordinates: Error decoding JSON object: Extra data: line 1 column 6 - line 1 column 29 (char 5 - 28)
@Mohab25 I entered a data description at every point, but noticed the dataset is described as not having a description - I just probably did something wrong.
Hello, I uploaded a CSV, also had issues with the bounding box coordinates.
thank you @Cindels63 and @Christiaan34 for the feed back, as long as you addressing these issues at early stages of development, chances that we can handle them early, @Cindels63 can you confirm that the description now appears in the datasets search page
@Cindels63 @Christiaan34 , i'm currently working on the geographic extent issues, i recall that MvdL also had issues with it.
@Cindels63 your point "In terms of loading 3rd party data – one cannot add an origination ...", this is intended, you only add data to organizations that you've been assigned "editor role" with, you can't create an organization unless you are a super admin or we temporary grant registered users permissions to create organizations, but you normally want to restrict granting users this ability to create orgs, and manage organizations creation centrally. now if you want to add the ‘Council for Geoscience’ and ‘Department of Water and Sanitation’ refer to @mikev3003 to create these.
thank you @Cindels63 and @Christiaan34 for the feed back, as long as you addressing these issues at early stages of development, chances that we can handle them early, @Cindels63 can you confirm that the description now appears in the datasets search page
Thank you @Mohab25 - I can confirm the descriptions are now visible.
@mikev3003 @Cindels63 @Christiaan34 good morning team, try to re-upload data with geographic features, i need a feed back on the geographic point and bounding box basic functionality, i think it works now (notice the help text with the field on how to input geographic coords and bounding box). == Edit == also try the Filter by location functionality on the datasets page (screen below, the map on the left, once the edit button is clicked the map grows bigger), you can draw a boundary to filter which dataset resides on that boundary.
@Mohab25 I tried to upload a land cover map but could only choose a single file for upload. As I understand it a shape file needs to be accommodated with several other associated files. It would now accept my gps co-ordinates so hopefully that's solved.
@mikev3003 yes the shapefile must go with at least 2 other files (shx, dbf), an easy fix for this is to zip all the files and upload them once (this is a general and common workflow) , if it's crucial to upload in bulk let me know and we will put that in the backlog.
@Mohab25 ok thanks. If we upload as zip will we be able to view the shape file on a map in ckan when searching for data though? Or is the idea to only to show the general location of the dataset?
@mikev3003 extending CKAN to show detailed spatial data is possible, i would first need the GCP design to be finalized, then will choose an approach, this to prevent effort duplication, temporarily we would be working with general locations of datasets.
A map view of a data set if it contains coordinates or geometry are part of the plan, see #4
@mikev3003 addressing this: "Geographic location bounding box - have a button with a link to a tool where this can be obtained?" note that after this one i'm going to link issues related to map to issue#4 linked above by @gubuntu
i've extended CKAN to add a mapping tool which will allow drawing boundaries and selecting geographic points , it can be used right now but few tweaks will come later.
@Christiaan34 addressing this " Recommend data reference date should only come up if data is not static (time-series), or at least not be compulsory", now if the data is static the field won't appear, but it will otherwise. @mikev3003 i've also noticed that if a user didn't check "Is this author a contact person for the dataset?" checkbox, they still can skip giving a contact, now they are forced to either check or give a contact person for the dataset, the same logic is used with "Did the author / contact organization collect the data?" field, for the agreement, users are forced to agree.
@Mohab25
@mikev3003 i didn't quite catch 1. , i though it's a descriptive statement with (True/False) value to indicate that the author/contact (who has already been an input in the form, when the user either input an author or a contact) is the same as the collection organization, should i re-input them again?
@Mohab25 for #1 the form seems to 'forget' that I indicated (true) that the author/contact organisation also collected the data
@mikev3003 update on the progress bar, after hours of research / testing different alternatives, it seems nothing suffice our use case, i've contacted a Sr. google engineer who works on GCS and had an outdated solution for this specific issue (https://github.com/GoogleCloudPlatform/storage-file-transfer-json-python/blob/master/chunked_transfer.py), he referred me to open a github issue, and it seems there is one already opened (back in 2019 - still going) with the exact name of "Allow tracking upload progress", i just made a comment to refresh things and hope in next releases we can get an indicator of upload progress (https://github.com/googleapis/python-storage/issues/27), Meanwhile i will change the upload style from stream to resumable upload, and will be focusing on other issues in hand.
@Mohab25 thanks for the feedback regarding the progress bar. I think the resumable upload system will work, and we can also add a message after the user clicks upload to explain how the system works and to please be patient. I tried uploading my big weather data file from Google Drive with a really strong internet connection but that didn't work either unfortunately.
@mikev3003 added a conditional check, now if the author/contact collected the data the form won't display the collector organization, also the same if the author is the same as the contact.
probably working tomorrow, if you want to meet regarding the large file upload
@Mohab25 I just tried the big file again but still getting the '502 Bad Gateway' message
@Mohab25 Any luck with the 1.7 GB file?
@Mohab25 Site functioning smoothly, but when I click on 'Explore' or 'Update Resource' for supplementary material I am getting an internal server error
There is a UI for uploading data sets to a GCP cloud storage back end
@mikev3003 @Christiaan34 is this satisfied, should we close it ? @Jeremy-Prior i would appropriate a quick final round of testing for the above issues.
@Mohab25
@Mohab25 @vermeulendivan Here are the common file formats we expect to be uploaded and how we propose they should be handled in CKAN/GCP. Should be pretty straightforward I imagine?
The living document can be edited or commented on here: https://www.dropbox.com/s/518ray5selfzvcm/20220930_Upload%20file%20types%20and%20processing%20in%20CKAN.xlsx?dl=0
@Mohab25 After a dataset has been created, if the users goes back later and adds an additional data file, this file does not currently get added to the correct folder in GCP. For example, I added 'test_file' to Maize long term trial, and then it created a new folder in GCP called maize-long-term-trial instead of placing it in the folder originally created for the dataset Where it went: Where it should have gone:
@Mohab25 I tried to delete the 'weather' file in 'maize long term trial' in CKAN but got this message:
@Mohab25 I tried to upload a land cover map for the Olifants river as tiff file format (about 280MB), but receive the following error:
If I then go back to datasets, it appears that the file was uploaded:
But when following the link to GCP there is no file
@Christiaan34 i'm applying new rules with the bucket, to lower urls and remove spaces and special chars, hold on for a moment.
@Mohab25 I'm seeing some duplication that we should keep an eye on
@Mohab25 just tried to upload the 1.6 GB file below but the system timed out https://console.cloud.google.com/storage/browser/wrc_wro_temp/Atlas%20of%20Agrohydrology_2008?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&project=wrc-wro
We need a UI for uploading data sets to GCP cloud storage backend
Install vanilla CKAN