kartoza / WRODataPlatform

WRC Water Research Observatory Data Platform
0 stars 3 forks source link

CKAN must index data in GCP that was not loaded through CKAN #13

Open gubuntu opened 2 years ago

gubuntu commented 2 years ago

If someone uploads data through CKAN, CKAN will be aware of it.

If someone uploads a file into Google Cloud Storage via the GCP console, CKAN needs to be made aware of it and index it and prompt the user to capture metadata for it.

Mohab25 commented 2 years ago

@gubuntu i've experiment with this using 2 solutions (Pub/Sub messages and cloud functions), i was able to create automatically a link to the cloud storage data from CKAN as a CKAN resource using it's API, comparing the two options cost-wise, cloud functions is the better solution as we only need to keep a reference to the cloud objects not the objects themselves, thus only light messages hold the path to the object and it's name is needed.

gubuntu commented 2 years ago

so, a cloud function makes a call to the CKAN API when a resource is uploaded to the bucket, and that call creates a new data set record in CKAN, or adds the resource to an existing record (data set) in CKAN depending on the 'folder' where you upload it?

What about capturing the metadata for a new data set created this way?

Jeremy-Prior commented 2 years ago

We (@ThiashaV and I) uploaded test datasets to GCP and discovered that if a dataset is added to a folder that was made on CKAN it reflects in BigQuery but not on CKAN:

We also discovered that if a dataset is uploaded to a new folder on GCP it does not reflect in BigQuery:

We were not prompted to enter metadata for any dataset uploaded to GCP.