User story: As an analyst I want to query data from a specific publisher

duncandewhurst commented 6 years ago

For example, I want to understand which item classification schemes are used in the data published by NSW, Australia.

I can do that by using a WHERE clause with the publisher's ocid prefix:

SELECT
  items -> 'classification' ->> 'scheme' AS scheme,
  COUNT(*) AS totalCount
FROM
  data
CROSS JOIN
  jsonb_array_elements(data -> 'tender' -> 'items') AS items
WHERE
  data ->> 'ocid' LIKE 'ocds-43qwtd%'
GROUP BY
  scheme
ORDER BY
  totalCount DESC;

However that approach has two issues:

If there are multiple loads of the publisher's data in the database (can that happen?) there will be double counting
If some releases from the publisher have a missing or incorrect ocid prefix they will not be counted in the results

To restrict my results to a specific load of a specific publisher's data (and to catch all data from that load) I need to do lots of joins:

SELECT
  items -> 'classification' ->> 'scheme' AS scheme,
  COUNT(*) AS totalCount
FROM
  data
CROSS JOIN
  jsonb_array_elements(data -> 'tender' -> 'items') AS items
JOIN
  release ON data.id = release.data_id
JOIN
  source_session_file_status ON release.source_session_file_status_id = source_session_file_status.id
JOIN
  source_session ON source_session_file_status.source_session_id = source_session.id
WHERE
  source_id = 'australia' AND data_version = '2018-06-19-15-11-16'
GROUP BY
  scheme
ORDER BY
  totalCount DESC;

Is there an easier way?

odscjames commented 6 years ago

If there are multiple loads of the publisher's data in the database (can that happen?

Yes, that can happen.

I can't think of a way easier than the joins immediately but I'll have a think.

robredpath commented 6 years ago

I can't think of a way easier than the joins immediately but I'll have a think.

SELECT into a temporary table?

odscjames commented 6 years ago

Could be very very big. Maybe Database Views would make it easy for people to work with while still keeping decent performance?

kindly commented 6 years ago

I do not think you can do this any more simply.

In the original schema design discussion it was argued that the release/record tables should have more de-normalization happen i.e to have source_id, data_version, sample and maybe a few other useful variables copied over when creating the table to mitigate too many joins.

I am happy to write a migration and update the code to do this if people are happy with this.

odscjames commented 6 years ago

I'm not sure that's needed. Even if we do need to de-normalize, we only need to de-normalize the source_session_id value to achieve the user requirement specified here. I'd like to try views before we try making DB structure changes.

jpmckinney commented 6 years ago

Instead of querying with direct SQL, what if analysts could open a Python shell and query using the ORM? One of the purposes of ORMs is to simplify the construction of queries.

If we pursue that, we could add a shell subcommand (similar to Django), to drop people into a Python shell (with perhaps some default classes already loaded for querying). (We can also add a dbshell subcommand to load a PostgreSQL session without specifying the database, user, etc.)

odscjames commented 5 years ago

Closing as moved to https://github.com/open-contracting/kingfisher-process/issues/46

open-contracting-archive / kingfisher-vagrant

User story: As an analyst I want to query data from a specific publisher #131