kartoza / WRODataPlatform

WRC Water Research Observatory Data Platform
0 stars 3 forks source link

Data set upload mechanism #2

Open gubuntu opened 2 years ago

gubuntu commented 2 years ago

We need a UI for uploading data sets to GCP cloud storage backend

Install vanilla CKAN

Mohab25 commented 2 years ago

investigating options: There are a number of parameters that can be taken into account comparing CloduSQL (managed PostgreSQL instance), PostgreSQL running inside a VM instance, and docker-PostGIS), few i can think of are:

Note: as the number of users isn’t expected to scale to the level of data, thus scalability isn’t taken as a comparison parameter.

Simplicity: The interaction model with managed cloud sql is through console, api’s, and cli tools, for non-technical staff of WRO, this is the simplest option as the console makes it easier to interact with the database and provision resources, from a development side CloudSQL is easier to integrate with other parts of the WRO system that relies on other GCP services(GKE, BigQuery), cloudSQL is more simpler than other two.

Security: For privacy, this is a matter of service-level agreement, for cloud sql google insures data encryption at rest and transit, as we are hosting CKAN within the same region of cloud sql, the connection can be done through Private IP connection which adds a security layer, though from a connection point, having PostGres installed within the same VM as CKAN can be more secure, for docker-PostGIS, it relates to how are we deploying it, there is no need to scale for admin db thus kubernetes is not an option here, which means we are connecting a docker container with CKAN in the same VM, from other side having the database with application in the same VM presents a single point of failure.

Extensibility: All three options can be extended with PostGIS, but both docker-PostGIS and a Postgres running in a VM are more customizable than the managed service, this the strongest case for these solutions.

Features: CloudSQL comes with a lot of features out of the box, most notably failover(automatic backups), autoscaling, patches and updates are automatically applied and others, the same can be achieved with the other two solutions with a bit of configuration.

Thus I think we should go with CloudSQL (managed service).

Mohab25 commented 2 years ago

Investigating Cloud storage with CKAN: CKAN stopped supporting file uploads to Cloud Storage providers since version 2.2 (Feb 2014), and uses CKAN “FileStore” local storage instead, files can be uploaded to the local storage through actions resource_create() and/or resource_update(), for cloud storage we can either extend CKAN or use a plugin (e.g. ckanext-cloudstorage), the suggested method is to begin with the second option and loop back to the first. ckanext-cloudstorage supports all the cloud providers supported by the open source libcloud python package (https://libcloud.apache.org/) which includes GCP cloud storage, thus working on testing this after going through issue#3 (https://github.com/kartoza/WRODataPlatform/issues/3) == creating the metadata Form.

gubuntu commented 2 years ago

based on meeting feedback today, please do attempt to implement storage into GCS folders.

A possible scenario: In the metadata form, fetch list of available folders for user to choose. This then gets stored in the metadata but is also used during the upload to direct the file storage location

Mohab25 commented 2 years ago

the current implementation takes WRO themes into account for dynamic directory creation, as a user inputs a Dataset Topic Category (Agriculture, Biodiversity, ..etc.) , that category gets created as a GCS directory, and resources are stored within those directories, note: GCS doesn't hold directories in the technical sense, rather these structures and paths found within GCS are visualizations of the objects stored (see https://cloud.google.com/storage/docs/folders).

gubuntu commented 2 years ago

@Mohab25 please document behaviour in all practical situations of uploading files so that we have clear procedures about file naming conventions, how to avoid overwriting or inadvertently deleting files, differentiating between apparent duplicates (same 'folder' and file name but different object ID) etc.

Mohab25 commented 2 years ago

@gubuntu for uploading / deletion situation, as ckanext-cloudstorage has a custom behavior and also uses underlying apache-libcloud (https://libcloud.apache.org/) to control cloud objects (including other vendors beside google), and as ckan uses certain flow to upload / delete resources, i had to re-write the logic in a custom extension for uploading / deletion in-order to have a stable flow, now we are using GCS python client lib to upload/delete objects from bucket (https://cloud.google.com/storage/docs/uploads-downloads).

for name resolution, instead of generating a random string, we now use the same resource id in the database which is guaranteed to be unique, the mapping of the name is as [resourceName_id_resourceName.ext , for example shapefile1_id_3dsfsdfss.shp], this way we can keep a connection between the cloud object with the ckan database, and separate objects with the same name.

mikev3003 commented 2 years ago

@Mohab25 the above sounds good. Is there now a platform available that we can test the data upload facility? http://130.211.222.159/ is not linking to the ckan interface.

Mohab25 commented 2 years ago

@mikev3003 hi Michael, i still don't have a permission to create a private google cloud sql instance (see the highlight in the screenshot). i want to be able to create a private connection, also i want to be able to create a service account (https://cloud.google.com/iam/docs/service-accounts) with GCP to connect ckan to your cloud storage instance.

Screenshot from 2022-05-18 09-41-17

mikev3003 commented 2 years ago

@Mohab25 I have changed your role to 'owner' could you please check if you are able to do it now? Otherwise I suggest we jump on a call together and go through the permission options to get you what you need.

Mohab25 commented 2 years ago

@mikev3003 Hello professor, OK, i've scheduled a meeting see if that a good timing for you? also sent you an email with some credentials for ckan site admin, give http://130.211.222.159/ a check.

Mohab25 commented 2 years ago

hello professor, it seems you weren't able to attend the call, to summarize, i need a new cloudsql (postgres) instance, with the following configs:

after that we can try to connect ckan with the new one.

mikev3003 commented 2 years ago

@Mohab25 I've given @gubuntu owner rights - hopefully he can assist with the needed permissions

mikev3003 commented 2 years ago

@Mohab25 data upload forms are looking good. Could you please assist with minor editing as indicated in these pdfs?(https://www.dropbox.com/t/y8xwr9Z1pkk3mZHo) Other comments:

  1. Will it be possible to save progress while user fills out form?
  2. 'Publication year' - for the whole team to think about what we want there.
  3. Geographic location bounding box - have a button with a link to a tool where this can be obtained? Also to convert between decimal degrees and degrees, minutes, seconds. Does CKAN have something like this available?
  4. Geographic location bounding box - MvdL inserted simple point co-ordinates and they were not accepted. Error message mentioning a JSON file was given.
  5. Should we have an option to include altitude? (for whole team to think about)
  6. Time series vs Static - MvdL uploaded both so suggest we add a 'Both' option
  7. Dataset title - is it possible to show how many characters have been used in real time?
  8. If there is an error in the form that needs to be fixed, it forgets if you checked that author is the contact person.
  9. For the upload form, perhaps the description is more about the file format(s) of data to be uploaded (for whole team to think about)
  10. Is it possible to have a progress bar for when data is being uploaded?
  11. It seemed to capture that I agreed to the data management plan as 'False' in the end.
  12. 'Name of data transfer format' is a bit confusing. What if user has zipper different files with different formats?
  13. After searching for the dataset MvdL has just uploaded, it indicated 'This dataset has no description' but I did provide a description.
  14. Max/min vert extent - maybe we should include a drop-down list of units (km, m, cm, mm etc.)? (for whole team to think about) @Mohab25 @gubuntu these are my first set of comments, meeting with the students tomorrow and will provide any additional ones then
Mohab25 commented 2 years ago

@mikev3003 received, i'm going through these comments one by one, i can tell from the first that you didn't like the first letters capitalized, this is handled by now.

Cindels63 commented 2 years ago

@Mohab25 I do not have the "add datasets" option under the datasets tab. I registered using e-mail address cindy.viviers63@gmail.com, username cindels63. Pretty plz.

Mohab25 commented 2 years ago

@Cindels63 hello, this is the intended behavior i was boring you about during the last meeting ^_^, ask @mikev3003 to add you as an editor of one of the organizations. Otherwise if you to be a super admin i can manage that.

Cindels63 commented 2 years ago

@Mohab25 with this, my 2c and experience in uploading a typical dataset in hydrogeology. • In terms of loading 3rd party data – one cannot add an origination which isn’t already listed. Maybe one can add one if one isn’t already registered – especially because it is an compulsory field to complete; also suggest adding the ‘Council for Geoscience’ and ‘Department of Water and Sanitation’ • Recommend data reference date should only come up if data is not static (time-series), or at least not be compulsory • Dataset extends across RSA, so after receiving a geographic coordinate error for putting in my own coordinates, I just copied and pasted the example coordinates. Still got error: Geographic location or bounding box coordinates: Error decoding JSON object: Extra data: line 1 column 6 - line 1 column 29 (char 5 - 28)

Cindels63 commented 2 years ago

@Mohab25 I entered a data description at every point, but noticed the dataset is described as not having a description - I just probably did something wrong.

No description 4 dataset detected

Christiaan34 commented 2 years ago

Hello, I uploaded a CSV, also had issues with the bounding box coordinates. image

Mohab25 commented 2 years ago

thank you @Cindels63 and @Christiaan34 for the feed back, as long as you addressing these issues at early stages of development, chances that we can handle them early, @Cindels63 can you confirm that the description now appears in the datasets search page

Screenshot from 2022-05-26 14-03-09

Mohab25 commented 2 years ago

@Cindels63 @Christiaan34 , i'm currently working on the geographic extent issues, i recall that MvdL also had issues with it.

Mohab25 commented 2 years ago

@Cindels63 your point "In terms of loading 3rd party data – one cannot add an origination ...", this is intended, you only add data to organizations that you've been assigned "editor role" with, you can't create an organization unless you are a super admin or we temporary grant registered users permissions to create organizations, but you normally want to restrict granting users this ability to create orgs, and manage organizations creation centrally. now if you want to add the ‘Council for Geoscience’ and ‘Department of Water and Sanitation’ refer to @mikev3003 to create these.

Cindels63 commented 2 years ago

thank you @Cindels63 and @Christiaan34 for the feed back, as long as you addressing these issues at early stages of development, chances that we can handle them early, @Cindels63 can you confirm that the description now appears in the datasets search page

Screenshot from 2022-05-26 14-03-09

Thank you @Mohab25 - I can confirm the descriptions are now visible.

Mohab25 commented 2 years ago

@mikev3003 @Cindels63 @Christiaan34 good morning team, try to re-upload data with geographic features, i need a feed back on the geographic point and bounding box basic functionality, i think it works now (notice the help text with the field on how to input geographic coords and bounding box). == Edit == also try the Filter by location functionality on the datasets page (screen below, the map on the left, once the edit button is clicked the map grows bigger), you can draw a boundary to filter which dataset resides on that boundary.

Screenshot from 2022-05-27 11-21-38

mikev3003 commented 2 years ago

@Mohab25 I tried to upload a land cover map but could only choose a single file for upload. As I understand it a shape file needs to be accommodated with several other associated files. It would now accept my gps co-ordinates so hopefully that's solved.

Mohab25 commented 2 years ago

@mikev3003 yes the shapefile must go with at least 2 other files (shx, dbf), an easy fix for this is to zip all the files and upload them once (this is a general and common workflow) , if it's crucial to upload in bulk let me know and we will put that in the backlog.

mikev3003 commented 2 years ago

@Mohab25 ok thanks. If we upload as zip will we be able to view the shape file on a map in ckan when searching for data though? Or is the idea to only to show the general location of the dataset?

Mohab25 commented 2 years ago

@mikev3003 extending CKAN to show detailed spatial data is possible, i would first need the GCP design to be finalized, then will choose an approach, this to prevent effort duplication, temporarily we would be working with general locations of datasets.

gubuntu commented 2 years ago

A map view of a data set if it contains coordinates or geometry are part of the plan, see #4

Mohab25 commented 2 years ago

@mikev3003 addressing this: "Geographic location bounding box - have a button with a link to a tool where this can be obtained?" note that after this one i'm going to link issues related to map to issue#4 linked above by @gubuntu

i've extended CKAN to add a mapping tool which will allow drawing boundaries and selecting geographic points , it can be used right now but few tweaks will come later.

map tool

Mohab25 commented 2 years ago

@Christiaan34 addressing this " Recommend data reference date should only come up if data is not static (time-series), or at least not be compulsory", now if the data is static the field won't appear, but it will otherwise. @mikev3003 i've also noticed that if a user didn't check "Is this author a contact person for the dataset?" checkbox, they still can skip giving a contact, now they are forced to either check or give a contact person for the dataset, the same logic is used with "Did the author / contact organization collect the data?" field, for the agreement, users are forced to agree.

mikev3003 commented 1 year ago

@Mohab25

  1. When user checks 'Did the author/contact organisation collect the data?', it seems that the contact organisation is not captured in the dataset information.
  2. On the upload page: 'Is data supplementary?' - please change to 'Supplementary material?'
  3. Is it possible to edit the landing page picture? Capture1
  4. Can you please add a new input 'Recommended citation' on the Create Dataset page immediately below 'Data description'? Thanks!
Mohab25 commented 1 year ago

@mikev3003 i didn't quite catch 1. , i though it's a descriptive statement with (True/False) value to indicate that the author/contact (who has already been an input in the form, when the user either input an author or a contact) is the same as the collection organization, should i re-input them again?

  1. of course this can overridden or removed, do you have something on mind, we can start theme-ing after next sprint and see how the UI can be improved.
  2. what should the help text state? , the help text the is the phrase under the input
mikev3003 commented 1 year ago

@Mohab25 for #1 the form seems to 'forget' that I indicated (true) that the author/contact organisation also collected the data Capture2

4 Preferred citation that users should use for this dataset.

Mohab25 commented 1 year ago

@mikev3003 update on the progress bar, after hours of research / testing different alternatives, it seems nothing suffice our use case, i've contacted a Sr. google engineer who works on GCS and had an outdated solution for this specific issue (https://github.com/GoogleCloudPlatform/storage-file-transfer-json-python/blob/master/chunked_transfer.py), he referred me to open a github issue, and it seems there is one already opened (back in 2019 - still going) with the exact name of "Allow tracking upload progress", i just made a comment to refresh things and hope in next releases we can get an indicator of upload progress (https://github.com/googleapis/python-storage/issues/27), Meanwhile i will change the upload style from stream to resumable upload, and will be focusing on other issues in hand.

mikev3003 commented 1 year ago

@Mohab25 thanks for the feedback regarding the progress bar. I think the resumable upload system will work, and we can also add a message after the user clicks upload to explain how the system works and to please be patient. I tried uploading my big weather data file from Google Drive with a really strong internet connection but that didn't work either unfortunately.

Mohab25 commented 1 year ago

@mikev3003 added a conditional check, now if the author/contact collected the data the form won't display the collector organization, also the same if the author is the same as the contact. Screenshot from 2022-07-15 20-09-07

probably working tomorrow, if you want to meet regarding the large file upload

mikev3003 commented 1 year ago

@Mohab25 I just tried the big file again but still getting the '502 Bad Gateway' message

mikev3003 commented 1 year ago

@Mohab25 Any luck with the 1.7 GB file?

mikev3003 commented 1 year ago

@Mohab25 Site functioning smoothly, but when I click on 'Explore' or 'Update Resource' for supplementary material I am getting an internal server error Capture

Jeremy-Prior commented 1 year ago

There is a UI for uploading data sets to a GCP cloud storage back end

Mohab25 commented 1 year ago

@mikev3003 @Christiaan34 is this satisfied, should we close it ? @Jeremy-Prior i would appropriate a quick final round of testing for the above issues.

Jeremy-Prior commented 1 year ago

@Mohab25

mikev3003 commented 1 year ago

@Mohab25 @vermeulendivan Here are the common file formats we expect to be uploaded and how we propose they should be handled in CKAN/GCP. Should be pretty straightforward I imagine?

3OctCapture4

The living document can be edited or commented on here: https://www.dropbox.com/s/518ray5selfzvcm/20220930_Upload%20file%20types%20and%20processing%20in%20CKAN.xlsx?dl=0

mikev3003 commented 1 year ago

@Mohab25 After a dataset has been created, if the users goes back later and adds an additional data file, this file does not currently get added to the correct folder in GCP. For example, I added 'test_file' to Maize long term trial, and then it created a new folder in GCP called maize-long-term-trial instead of placing it in the folder originally created for the dataset Where it went: 3OctCapture5 Where it should have 3OctCapture6 gone:

mikev3003 commented 1 year ago

@Mohab25 I tried to delete the 'weather' file in 'maize long term trial' in CKAN but got this message: 3OctCapture7

Christiaan34 commented 1 year ago

@Mohab25 I tried to upload a land cover map for the Olifants river as tiff file format (about 280MB), but receive the following error: OLC

If I then go back to datasets, it appears that the file was uploaded: OLC1

But when following the link to GCP there is no file OLC2

Mohab25 commented 1 year ago

@Christiaan34 i'm applying new rules with the bucket, to lower urls and remove spaces and special chars, hold on for a moment.

mikev3003 commented 1 year ago

@Mohab25 I'm seeing some duplication that we should keep an eye on 7OctCapture1 7OctCapture2

mikev3003 commented 1 year ago

@Mohab25 just tried to upload the 1.6 GB file below but the system timed out https://console.cloud.google.com/storage/browser/wrc_wro_temp/Atlas%20of%20Agrohydrology_2008?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&project=wrc-wro