Closed u10313335 closed 5 months ago
Thanks for submitting your first pull request! You are awesome! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct.
You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:
The one question for me is wether the unit should be 'dataset' or 'dataset at a particular version'. This is illustrated by the current unit test failing because the version of the package has changed. The equivalent of sorts in git land is 'git repo url + hash', where the hash is counted as the 'ref'. If no 'ref' is provided, we default to using the latest (or whatever
HEAD
) points to. Does CKAN need something similar?
Thank you @yuvipanda for your review! As a newcomer to Project Jupyter, it really encouraged me.
Regard to the question referring to the unit, there are indeed "activities" that work like the versions for datasets. For example:
https://demo.ckan.org/dataset/activity/sample-dataset-1
However, activities are often unavailable as they can be hidden by admins. In fact, the above demo site of CKAN also restricts the access to activities. Another example is the data.gov.
Therefore, from my point of view, it's much safer to take the 'dataset' rather than the 'dataset at a particular version' as the unit for the CKAN provider.
However, activities are often unavailable as they can be hidden by admins. In fact, the above demo site of CKAN also restricts the access to activities. Another example is the data.gov.
Does this just disable listing previous versions, or does it completely prevent access to previous versions? Do you know why admins chose to prevent this, since they're reducing the utility of their CKAN repository as a reproducible data source?
However, activities are often unavailable as they can be hidden by admins. In fact, the above demo site of CKAN also restricts the access to activities. Another example is the data.gov.
Does this just disable listing previous versions, or does it completely prevent access to previous versions? Do you know why admins chose to prevent this, since they're reducing the utility of their CKAN repository as a reproducible data source?
When the ckan.auth.public_activity_stream_detail
is set to True, it will completely block both list and access to previous versions for the public (without login). The reason was explained in the changelog:
A full history of dataset changes is now displayed in the Activity Stream to admins, and optionally to the public. By default this is enabled for new installs, but disabled for sites which upgrade (just in case the history is sensitive).
AFAIK, most of sites were upgraded from the old versions of CKAN.
Or I think we could still support the versions (urls with activity_id), e.g.:
https://demo.ckan.org/dataset/sample-dataset-1?activity_id=0da6b0a5-9fe5-4c34-9545-9079903a8449
And go to the latest one when the version is missing, e.g.:
https://demo.ckan.org/dataset/sample-dataset-1
I think we could still support the versions (urls with activity_id), e.g.: https://demo.ckan.org/dataset/sample-dataset-1?activity_id=0da6b0a5-9fe5-4c34-9545-9079903a8449 And go to the latest one when the version is missing
That sounds reasonable to me- support a link to a fixed version where possible, fallback to the latest if not. Are there any disadvantages to doing this?
That sounds reasonable to me- support a link to a fixed version where possible, fallback to the latest if not. Are there any disadvantages to doing this?
I don't see any disadvantages in this moment. I will try to handle this in the following days.
The content provider has been updated to support the versions ("activities" in CKAN) and fallback to the latest. There are two types of URL patterns to the activities:
https://SITE_URL/dataset/DATASET_NAME?activity_id=ACTIVITY_ID (for CKAN version 2.9), e.g., https://demo.ckan.org/dataset/sample-dataset-1?activity_id=0da6b0a5-9fe5-4c34-9545-9079903a8449.
https://SITE_URL/dataset/DATASET_NAME/history/ACTIVITY_ID (for CKAN version 2.10). I can't find any CKAN 2.10 sites which opens up access to activities for the public (we are still using the CKAN 2.9...) — sorry about that. But it has been tested with an internal CKAN 2.10 site.
Also note that the metadata_modified
timestamp is still used to determine the "versions" in the repo2docker since the metadata_modified
is available across all CKAN versions.
Hi @yuvipanda @manics, if there is any need for further improvement, please let me know. Thanks!
@u10313335 apologies for the delayed response. I think this is good to go, but I wanted to cleanup some of the URL handling a little bit. I can't seem to push to your branch, nor make a PR to your branch for some reason, so I've pushed my changes to https://github.com/yuvipanda/repo2docker/tree/ckan. Also here's a patch file, which can be applied via git am
if needed:
From 49613412583def1330cd8b7d6e42664728367945 Mon Sep 17 00:00:00 2001
From: YuviPanda <yuvipanda@gmail.com>
Date: Mon, 10 Jun 2024 15:49:40 -0700
Subject: [PATCH 1/2] Use urlencode to construct query strings
---
repo2docker/contentproviders/ckan.py | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/repo2docker/contentproviders/ckan.py b/repo2docker/contentproviders/ckan.py
index 9131b01..c1e31a5 100644
--- a/repo2docker/contentproviders/ckan.py
+++ b/repo2docker/contentproviders/ckan.py
@@ -1,6 +1,6 @@
from datetime import datetime, timedelta, timezone
from os import path
-from urllib.parse import parse_qs, urlparse
+from urllib.parse import parse_qs, urlparse, urlencode
from requests import Session
@@ -86,11 +86,10 @@ class CKAN(ContentProvider):
# handle the activites
if activity_id:
fetch_url = (
- f"{spec['api_url']}activity_data_show?"
- f"id={activity_id}&object_type=package"
+ f"{spec['api_url']}activity_data_show?" + urlencode({"id": activity_id, "object_type": "package"})
)
else:
- fetch_url = f"{spec['api_url']}package_show?id={dataset_id}"
+ fetch_url = f"{spec['api_url']}package_show?" + urlencode({"id": dataset_id})
resp = self.urlopen(
fetch_url,
--
2.43.2
From 45fb7b908527630370b4ff1a99c89c5afafea2dc Mon Sep 17 00:00:00 2001
From: YuviPanda <yuvipanda@gmail.com>
Date: Mon, 10 Jun 2024 17:13:22 -0700
Subject: [PATCH 2/2] Cleanup URL parsing mechanisms
---
repo2docker/contentproviders/ckan.py | 26 +++++++++++++++-----------
1 file changed, 15 insertions(+), 11 deletions(-)
diff --git a/repo2docker/contentproviders/ckan.py b/repo2docker/contentproviders/ckan.py
index c1e31a5..9c74afd 100644
--- a/repo2docker/contentproviders/ckan.py
+++ b/repo2docker/contentproviders/ckan.py
@@ -43,28 +43,32 @@ class CKAN(ContentProvider):
if not parsed_url.netloc:
return None
- url_parts_1 = parsed_url.path.split("/history/")
- url_parts_2 = url_parts_1[0].split("/")
- if url_parts_2[-2] == "dataset":
- self.dataset_id = url_parts_2[-1]
- else:
+ if "/dataset/" not in parsed_url.path:
+ # Not actually a dataset
return None
- api_url_path = "/api/3/action/"
+ # CKAN may be under a URL prefix, and we should accomodate that
+ url_prefix, dataset_url = parsed_url.path.split("/dataset/")
+
+ dataset_url_parts = dataset_url.split("/")
+ self.dataset_id = dataset_url_parts[0]
+
api_url = parsed_url._replace(
- path="/".join(url_parts_2[:-2]) + api_url_path, query=""
+ path=f"{url_prefix}/api/3/action/", query=""
).geturl()
status_show_url = f"{api_url}status_show"
resp = self.urlopen(status_show_url)
if resp.status_code == 200:
- # handle the activites
+ # Activity ID may be present either as a query parameter, activity_id
+ # or as part of the URL, under `/history/<activity-id>`. If `/history/`
+ # is present, that takes precedence over `activity_id`
activity_id = None
- if parse_qs(parsed_url.query).get("activity_id") is not None:
+ if "history" in dataset_url_parts:
+ activity_id = dataset_url_parts[dataset_url_parts.index("history") + 1]
+ elif parse_qs(parsed_url.query).get("activity_id") is not None:
activity_id = parse_qs(parsed_url.query).get("activity_id")[0]
- if len(url_parts_1) == 2:
- activity_id = url_parts_1[-1]
self.version = self._fetch_version(api_url)
return {
--
2.43.2
If these changes look good to you, apply those and I'll merge it? If not you can also select the 'allow edits from maintainers' button and I'll push these in too.
Thanks @yuvipanda again for the reviews and suggestions! I have applied your commits.
I've somehow missed your comment - sorry about that, @u10313335.
Tests are now unfortunately failing due to https://github.com/jupyterhub/repo2docker/issues/1354. I'm running all these tests locally - if they pass, I'll merge this PR (as it's already been waiting so long!) before figuring out how to fix that.
Thank you for your patience.
I was able to successfully run these tests locally, so I'm going to merge this one. I'll try to come back to the associated binderhub PR soon as well.
Thank you so much for your patience as we worked through this, @u10313335! Hope to continue to see PRs from you, and I think our reviewing speeds should be better now.
This PR add support for CKAN datasets as content provider for repo2docker, e.g.:
It has been tested on some CKAN repositories such as:
We also adopt this PR in our Binder service (as the
CKAN dataset
repository): https://binder.depositar.io/, e.g.: