silverstripe / silverstripe-ckan-registry

BSD 3-Clause "New" or "Revised" License
6 stars 3 forks source link

Populate registry table with data #72

Closed NightJar closed 5 years ago

NightJar commented 5 years ago

Once #71 is complete use the CMS configuration and the JS API from #70 to populate the table with data from the appropriate resource.

This should include managing the pagination on the frontend too. Made through a static config

NightJar commented 5 years ago

As per https://github.com/silverstripe/silverstripe-ckan-registry/issues/71#issuecomment-454974806 the PR for this issue will also add the table headers, as they come parts-in-parcel with the data.

NightJar commented 5 years ago

@camfindlay We've run into an unexpected issue with this ... issue.

The "remove duplicates" option we've created in the CMS is not usable as expected. The CKAN API gives reference to a distinct modifier to queries, which we had expected to be able to use akin to how it does in SQL.

However, in reality it seems to work counter-intuitively. e.g. the Family Services in Auckland set has 145 records. Selecting "Provider Name", "Provider Region", "Provider, District" [sic] (the column names have line breaks in them) we get a whole bunch from the same Provider ID, meaning lots of duplicate rows.

Thought solution: Use the distinct option! :D However, after using this we get... 145 results, including multiple groups of the same - Oh no! D:

Introducing ProviderID to the query shows that they are indeed all the same provider.

So we thought to introduce a search term: Provider+Name=Kelly We get 3 results. Using distinct, we get 1 results... but a total of 3!

This is ... unfortunate. What should we do?

NightJar commented 5 years ago

We thought to switch to the CKAN SQL endpoint which seems to behave more expectly (being plain SQL).

This works great, until your SQL becomes more intense, and the Incapsula WAF blocks the request Which is interesting... select statements are fine until you introduce a subquery or method like count(*)!

ScopeyNZ commented 5 years ago

One issue is that the total doesn't account for any "DISTINCT" option - maybe that's a CKAN issue, or "working as intended"?

We can probably work around this by doing two queries when loading data but it's a bit kludgey.

ScopeyNZ commented 5 years ago

Okay just to summarise:

Problems

Problem 1: Can't distinct on columns

This seems to be by design. I guess that was our mistake. I have a quote from me earlier saying that I can do it, but it does not seem possible now... There's a couple of solutions I can think of from the top of my head:

Problem 2: Distinct option doesn't give the right total

It's unclear if this is by design or not. Either way it means we can't show pagination correctly


We could solve these problem by delving into the datastore_search_sql endpoint, but the CKAN docs make this clear that it probably has limited support per CKAN instance. In our testing with data.govt.nz we've had Incapsula blocking some requests. We could probably solve this issue with some relatively complex SQL queries...:

SELECT count(*), "Provider Name" FROM (SELECT DISTINCT "ProviderID", "Provider Name" FROM "496d2b91-2e9a-4e72-9c07-cf41141aefa6" WHERE "Provider Name" ILIKE "Kelly") as q LIMIT 30

I'd prefer to avoid this though if possible - I think it reduces the compatibility with an arbitrary CKAN.

camfindlay commented 5 years ago

I have a feeling the count information is buggy and we should hit up the ckan project about that. Let's put that aside for now, you can also have the result return with no total (which actually speeds up the request), see include_count=false.

On to the DISTINCT issue...

Here's something you can look over, there might be a solve/workable solution for this in some other standalone work going on got a family services directory front end at https://github.com/ServiceInnovationLab/fsd-spike however note they only show results when you select a filter so they are doing a global distinct. Read on before chasing that rabbit...

One of the things I've noticed playing with the queries to CKAN is that that DISTINCT seems to apply to all the fields you include in fields query string. I did try moving the order of the items in this incase the first field listed is the column being used to apply the DISTINCT, not so much.

Example the above work from the innovation lab makes this query:

https://catalogue.data.govt.nz/api/3/action/datastore_search?distinct=true&resource_id=35de6bf8-b254-4025-89f5-da9eb6adf9a0&fields=FSD_ID,PROVIDER_CLASSIFICATION,LONGITUDE,LATITUDE,PROVIDER_NAME,PUBLISHED_CONTACT_EMAIL_1,PUBLISHED_PHONE_1,PROVIDER_CONTACT_AVAILABILITY,ORGANISATION_PURPOSE,PHYSICAL_ADDRESS,PROVIDER_WEBSITE_1&filters=%7B%22LEVEL_1_CATEGORY%22:%22Basic%20Needs%22%7D

Which returns as expected, a unique set of rows. Note the filter option here.

Now see what happens if I remove the filter and add a sort on the FSD_ID (to prove distinct still applies). ANSWER: unique rows as we expect.

https://catalogue.data.govt.nz/api/3/action/datastore_search?distinct=true&resource_id=35de6bf8-b254-4025-89f5-da9eb6adf9a0&fields=FSD_ID,PROVIDER_CLASSIFICATION,LONGITUDE,LATITUDE,PROVIDER_NAME,PUBLISHED_CONTACT_EMAIL_1,PUBLISHED_PHONE_1,PROVIDER_CONTACT_AVAILABILITY,ORGANISATION_PURPOSE,PHYSICAL_ADDRESS,PROVIDER_WEBSITE_1&sort=FSD_ID

Now let's add in our category columns to the fields string...

https://catalogue.data.govt.nz/api/3/action/datastore_search?distinct=true&resource_id=35de6bf8-b254-4025-89f5-da9eb6adf9a0&fields=FSD_ID,LEVEL_1_CATEGORY,LEVEL_2_CATEGORY,PROVIDER_CLASSIFICATION,LONGITUDE,LATITUDE,PROVIDER_NAME,PUBLISHED_CONTACT_EMAIL_1,PUBLISHED_PHONE_1,PROVIDER_CONTACT_AVAILABILITY,ORGANISATION_PURPOSE,PHYSICAL_ADDRESS,PROVIDER_WEBSITE_1&sort=FSD_ID

BOOM - duplicate rows based on the items included in our defined fields string.

Both LEVEL_1_CATEGORY and LEVEL_2_CATEGORY have unique values, distinct is actually still working... but now we have some new columns in fields making a greater number of distinct rows to display.

This seems to prove that distinct simply applies to all columns you specify (in our case it will be applied to all columns set in the "Display in results" checkboxes.

Let's return to that filter option...

If we both include the LEVEL_1_CATEGORY field in fields AND a filter on LEVEL_1_CATEGORY (which in our case would be the same as applying a dropdown filter)... you get the desired end result, unique rows filtered and including the name of the category in results. You can also at the same time provide a q fulltext search query in the same request and it's respected which gives us our text search + dropdown combo.

Example:

https://catalogue.data.govt.nz/api/3/action/datastore_search?distinct=true&resource_id=35de6bf8-b254-4025-89f5-da9eb6adf9a0&fields=FSD_ID,LEVEL_1_CATEGORY,PROVIDER_CLASSIFICATION,LONGITUDE,LATITUDE,PROVIDER_NAME,PUBLISHED_CONTACT_EMAIL_1,PUBLISHED_PHONE_1,PROVIDER_CONTACT_AVAILABILITY,ORGANISATION_PURPOSE,PHYSICAL_ADDRESS,PROVIDER_WEBSITE_1&filters=%7B%22LEVEL_1_CATEGORY%22:%22Basic%20Needs%22%7D&q=Pathways&sort=FSD_ID

For our purposes it points at a slight change in the UX.

Distinct would need to be turned on at the dataset resource level ("[ ] This dataset contains duplicate rows for categories" or something) and as long as you don't also include those category columns in the results display that are also going to be used for a category based dropdown filter, you'll get a usable register search with expected results.

I think we might have to take this approach to get across the line. It's a minimal change to the UX (drop per column distinct and add to the dataset setup).

The FSD is one of the most edge case datasets, most others will be using single rows with delimiter separated categories in a single columns so we wont have this issue (however it's the FSD that will be a big user of this module so we'll need this workable solve!).

We can solve the end CMS users config-ing for this edge case in docs (eg. "if your dataset is like this then these rules apply, turn on this checkbox, exclude category columns from your search results etc etc"), alter the UX for a single distinct applying at the dataset level. I think thats in min viable solve and I'm good with that.

We can also raise an issue on the CKAN core project to perhaps add a per column distinct option (but that'll be for both a future version of CKAN and this module). We'll also raise a bug on CKAN re: the total calculation respecting distinct which is clearly a problem.

Just a note on returning the count... the purpose of users using this module is for them to find information about items in the registers they have interest in (then drill down for more detail), not "how many of these items are there?" so the count itself is less important (and if they are data included, our reference link to the actual data means they can get the data and slice/dice it however they like). I'd be ok dropping display of the count if the dataset has been setup in the CMS with this new proposed dataset level distinct option (or at a pinch dropping result count feature in Griddle altogether if that's more consistent).

camfindlay commented 5 years ago

Looks like the count issue is actually solved https://github.com/ckan/ckan/pull/4236 (we just need to apply the patch on our instance). In light of that... i'd say assume the count returns correct for Griddle display purposes...

ScopeyNZ commented 5 years ago

Looks like the count issue is actually solved ckan/ckan#4236 (we just need to apply the patch on our instance). In light of that... i'd say assume the count returns correct for Griddle display purposes...

Okay we can run with that. For pagination until that fix is applied the number of records in the result may not match the number specified and you might get duplicates between pages - but I guess that'll be fine.

The other option is that we are able to run a single count query in the first instance - this might be good as we can indicate we don't want a count for other queries:

https://catalogue.data.govt.nz/api/3/action/datastore_search_sql?sql=SELECT%20count(DISTINCT%20%22ProviderID%22)%20FROM%20%22496d2b91-2e9a-4e72-9c07-cf41141aefa6%22

Distinct would need to be turned on at the dataset resource level ("[ ] This dataset contains duplicate rows for categories" or something) and as long as you don't also include those category columns in the results display that are also going to be used for a category based dropdown filter, you'll get a usable register search with expected results.

Yep I agree this is a good stop-gap that I also touched on:

We move the distinct option from "per column" to just be "per results". This actually might make it less confusing for users. This just means that there will never be two results in the table where all the columns are the same value. This should cover the real-life use case? Might be something to think about from a UX point of view

I'll run it past Sacha on Tuesday. FWIW, I think if you end up raising anything against CKAN you should mention that the current behaviour is very mystifying. It seems to apply DISTINCT to all columns that you are "SELECTing" - not really very common behaviour. The only other idea here is that we could perform two API calls for each page - "SELECT" ProviderID with our filters with the "distinct" option on, then "SELECT again filtering on our previously requested ProviderIDs.

ScopeyNZ commented 5 years ago

Doing a little code delving - it looks like the order of the columns should matter and that you were on to something. I can't seem to generate a request that does what I expect it to do though:

https://github.com/ckan/ckan/blob/master/ckanext/datastore/backend/postgres.py#L1252-L1256

ScopeyNZ commented 5 years ago

Never mind I'm getting confused - it does distinct on every column with that syntax. I'm thinking about the DISTINCT ON (column) syntax.