Closed NightJar closed 5 years ago
As per https://github.com/silverstripe/silverstripe-ckan-registry/issues/71#issuecomment-454974806 the PR for this issue will also add the table headers, as they come parts-in-parcel with the data.
@camfindlay We've run into an unexpected issue with this ... issue.
The "remove duplicates" option we've created in the CMS is not usable as expected.
The CKAN API gives reference to a distinct
modifier to queries, which we had expected to be able to use akin to how it does in SQL.
However, in reality it seems to work counter-intuitively. e.g. the Family Services in Auckland set has 145 records. Selecting "Provider Name", "Provider Region", "Provider, District" [sic] (the column names have line breaks in them) we get a whole bunch from the same Provider ID, meaning lots of duplicate rows.
Thought solution: Use the distinct
option! :D
However, after using this we get...
145 results, including multiple groups of the same - Oh no! D:
Introducing ProviderID to the query shows that they are indeed all the same provider.
So we thought to introduce a search term: Provider+Name=Kelly
We get 3 results.
Using distinct, we get 1 results... but a total of 3!
This is ... unfortunate. What should we do?
We thought to switch to the CKAN SQL endpoint which seems to behave more expectly (being plain SQL).
This works great, until your SQL becomes more intense, and the Incapsula WAF blocks the request
Which is interesting... select
statements are fine until you introduce a subquery or method like count(*)
!
One issue is that the total doesn't account for any "DISTINCT" option - maybe that's a CKAN issue, or "working as intended"?
We can probably work around this by doing two queries when loading data but it's a bit kludgey.
Okay just to summarise:
distinct=1
- Result gives us one record, total indicates three recordsThis seems to be by design. I guess that was our mistake. I have a quote from me earlier saying that I can do it, but it does not seem possible now... There's a couple of solutions I can think of from the top of my head:
It's unclear if this is by design or not. Either way it means we can't show pagination correctly
We could solve these problem by delving into the datastore_search_sql
endpoint, but the CKAN docs make this clear that it probably has limited support per CKAN instance. In our testing with data.govt.nz we've had Incapsula blocking some requests. We could probably solve this issue with some relatively complex SQL queries...:
SELECT count(*), "Provider Name" FROM (SELECT DISTINCT "ProviderID", "Provider Name" FROM "496d2b91-2e9a-4e72-9c07-cf41141aefa6" WHERE "Provider Name" ILIKE "Kelly") as q LIMIT 30
I'd prefer to avoid this though if possible - I think it reduces the compatibility with an arbitrary CKAN.
I have a feeling the count information is buggy and we should hit up the ckan project about that. Let's put that aside for now, you can also have the result return with no total (which actually speeds up the request), see include_count=false
.
On to the DISTINCT issue...
Here's something you can look over, there might be a solve/workable solution for this in some other standalone work going on got a family services directory front end at https://github.com/ServiceInnovationLab/fsd-spike however note they only show results when you select a filter so they are doing a global distinct. Read on before chasing that rabbit...
One of the things I've noticed playing with the queries to CKAN is that that DISTINCT seems to apply to all the fields you include in fields
query string. I did try moving the order of the items in this incase the first field listed is the column being used to apply the DISTINCT, not so much.
Example the above work from the innovation lab makes this query:
https://catalogue.data.govt.nz/api/3/action/datastore_search?distinct=true&resource_id=35de6bf8-b254-4025-89f5-da9eb6adf9a0&fields=FSD_ID,PROVIDER_CLASSIFICATION,LONGITUDE,LATITUDE,PROVIDER_NAME,PUBLISHED_CONTACT_EMAIL_1,PUBLISHED_PHONE_1,PROVIDER_CONTACT_AVAILABILITY,ORGANISATION_PURPOSE,PHYSICAL_ADDRESS,PROVIDER_WEBSITE_1&filters=%7B%22LEVEL_1_CATEGORY%22:%22Basic%20Needs%22%7D
Which returns as expected, a unique set of rows. Note the filter
option here.
Now see what happens if I remove the filter
and add a sort
on the FSD_ID (to prove distinct still applies). ANSWER: unique rows as we expect.
https://catalogue.data.govt.nz/api/3/action/datastore_search?distinct=true&resource_id=35de6bf8-b254-4025-89f5-da9eb6adf9a0&fields=FSD_ID,PROVIDER_CLASSIFICATION,LONGITUDE,LATITUDE,PROVIDER_NAME,PUBLISHED_CONTACT_EMAIL_1,PUBLISHED_PHONE_1,PROVIDER_CONTACT_AVAILABILITY,ORGANISATION_PURPOSE,PHYSICAL_ADDRESS,PROVIDER_WEBSITE_1&sort=FSD_ID
Now let's add in our category columns to the fields
string...
https://catalogue.data.govt.nz/api/3/action/datastore_search?distinct=true&resource_id=35de6bf8-b254-4025-89f5-da9eb6adf9a0&fields=FSD_ID,LEVEL_1_CATEGORY,LEVEL_2_CATEGORY,PROVIDER_CLASSIFICATION,LONGITUDE,LATITUDE,PROVIDER_NAME,PUBLISHED_CONTACT_EMAIL_1,PUBLISHED_PHONE_1,PROVIDER_CONTACT_AVAILABILITY,ORGANISATION_PURPOSE,PHYSICAL_ADDRESS,PROVIDER_WEBSITE_1&sort=FSD_ID
BOOM - duplicate rows based on the items included in our defined fields
string.
Both LEVEL_1_CATEGORY and LEVEL_2_CATEGORY have unique values, distinct
is actually still working... but now we have some new columns in fields
making a greater number of distinct rows to display.
This seems to prove that distinct
simply applies to all columns you specify (in our case it will be applied to all columns set in the "Display in results" checkboxes.
Let's return to that filter
option...
If we both include the LEVEL_1_CATEGORY
field in fields
AND a filter
on LEVEL_1_CATEGORY (which in our case would be the same as applying a dropdown filter)... you get the desired end result, unique rows filtered and including the name of the category in results. You can also at the same time provide a q
fulltext search query in the same request and it's respected which gives us our text search + dropdown combo.
Example:
https://catalogue.data.govt.nz/api/3/action/datastore_search?distinct=true&resource_id=35de6bf8-b254-4025-89f5-da9eb6adf9a0&fields=FSD_ID,LEVEL_1_CATEGORY,PROVIDER_CLASSIFICATION,LONGITUDE,LATITUDE,PROVIDER_NAME,PUBLISHED_CONTACT_EMAIL_1,PUBLISHED_PHONE_1,PROVIDER_CONTACT_AVAILABILITY,ORGANISATION_PURPOSE,PHYSICAL_ADDRESS,PROVIDER_WEBSITE_1&filters=%7B%22LEVEL_1_CATEGORY%22:%22Basic%20Needs%22%7D&q=Pathways&sort=FSD_ID
For our purposes it points at a slight change in the UX.
Distinct would need to be turned on at the dataset resource level ("[ ] This dataset contains duplicate rows for categories" or something) and as long as you don't also include those category columns in the results display that are also going to be used for a category based dropdown filter, you'll get a usable register search with expected results.
I think we might have to take this approach to get across the line. It's a minimal change to the UX (drop per column distinct and add to the dataset setup).
The FSD is one of the most edge case datasets, most others will be using single rows with delimiter separated categories in a single columns so we wont have this issue (however it's the FSD that will be a big user of this module so we'll need this workable solve!).
We can solve the end CMS users config-ing for this edge case in docs (eg. "if your dataset is like this then these rules apply, turn on this checkbox, exclude category columns from your search results etc etc"), alter the UX for a single distinct applying at the dataset level. I think thats in min viable solve and I'm good with that.
We can also raise an issue on the CKAN core project to perhaps add a per column distinct option (but that'll be for both a future version of CKAN and this module). We'll also raise a bug on CKAN re: the total calculation respecting distinct which is clearly a problem.
Just a note on returning the count... the purpose of users using this module is for them to find information about items in the registers they have interest in (then drill down for more detail), not "how many of these items are there?" so the count itself is less important (and if they are data included, our reference link to the actual data means they can get the data and slice/dice it however they like). I'd be ok dropping display of the count if the dataset has been setup in the CMS with this new proposed dataset level distinct option (or at a pinch dropping result count feature in Griddle altogether if that's more consistent).
Looks like the count issue is actually solved https://github.com/ckan/ckan/pull/4236 (we just need to apply the patch on our instance). In light of that... i'd say assume the count returns correct for Griddle display purposes...
Looks like the count issue is actually solved ckan/ckan#4236 (we just need to apply the patch on our instance). In light of that... i'd say assume the count returns correct for Griddle display purposes...
Okay we can run with that. For pagination until that fix is applied the number of records in the result may not match the number specified and you might get duplicates between pages - but I guess that'll be fine.
The other option is that we are able to run a single count query in the first instance - this might be good as we can indicate we don't want a count for other queries:
https://catalogue.data.govt.nz/api/3/action/datastore_search_sql?sql=SELECT%20count(DISTINCT%20%22ProviderID%22)%20FROM%20%22496d2b91-2e9a-4e72-9c07-cf41141aefa6%22
Distinct would need to be turned on at the dataset resource level ("[ ] This dataset contains duplicate rows for categories" or something) and as long as you don't also include those category columns in the results display that are also going to be used for a category based dropdown filter, you'll get a usable register search with expected results.
Yep I agree this is a good stop-gap that I also touched on:
We move the distinct option from "per column" to just be "per results". This actually might make it less confusing for users. This just means that there will never be two results in the table where all the columns are the same value. This should cover the real-life use case? Might be something to think about from a UX point of view
I'll run it past Sacha on Tuesday. FWIW, I think if you end up raising anything against CKAN you should mention that the current behaviour is very mystifying. It seems to apply DISTINCT
to all columns that you are "SELECTing" - not really very common behaviour. The only other idea here is that we could perform two API calls for each page - "SELECT" ProviderID with our filters with the "distinct" option on, then "SELECT again filtering on our previously requested ProviderIDs.
Doing a little code delving - it looks like the order of the columns should matter and that you were on to something. I can't seem to generate a request that does what I expect it to do though:
https://github.com/ckan/ckan/blob/master/ckanext/datastore/backend/postgres.py#L1252-L1256
Never mind I'm getting confused - it does distinct on every column with that syntax. I'm thinking about the DISTINCT ON (column)
syntax.
Once #71 is complete use the CMS configuration and the JS API from #70 to populate the table with data from the appropriate resource.
This should include managing the pagination on the frontend too. Made through a static config