Closed duncandewhurst closed 5 years ago
If there are multiple loads of the publisher's data in the database (can that happen?
Yes, that can happen.
I can't think of a way easier than the joins immediately but I'll have a think.
I can't think of a way easier than the joins immediately but I'll have a think.
SELECT into a temporary table?
Could be very very big. Maybe Database Views would make it easy for people to work with while still keeping decent performance?
I do not think you can do this any more simply.
In the original schema design discussion it was argued that the release/record tables should have more de-normalization happen i.e to have source_id, data_version, sample and maybe a few other useful variables copied over when creating the table to mitigate too many joins.
I am happy to write a migration and update the code to do this if people are happy with this.
I'm not sure that's needed. Even if we do need to de-normalize, we only need to de-normalize the source_session_id value to achieve the user requirement specified here. I'd like to try views before we try making DB structure changes.
Instead of querying with direct SQL, what if analysts could open a Python shell and query using the ORM? One of the purposes of ORMs is to simplify the construction of queries.
If we pursue that, we could add a shell
subcommand (similar to Django), to drop people into a Python shell (with perhaps some default classes already loaded for querying). (We can also add a dbshell
subcommand to load a PostgreSQL session without specifying the database, user, etc.)
Closing as moved to https://github.com/open-contracting/kingfisher-process/issues/46
For example, I want to understand which item classification schemes are used in the data published by NSW, Australia.
I can do that by using a WHERE clause with the publisher's ocid prefix:
However that approach has two issues:
If there are multiple loads of the publisher's data in the database (can that happen?) there will be double counting
If some releases from the publisher have a missing or incorrect ocid prefix they will not be counted in the results
To restrict my results to a specific load of a specific publisher's data (and to catch all data from that load) I need to do lots of joins:
Is there an easier way?