open-contracting / ocdskit

A suite of command-line tools for working with OCDS data
https://ocdskit.readthedocs.io
BSD 3-Clause "New" or "Revised" License
17 stars 7 forks source link

tabulate: Code review / Alternative tools #75

Closed jpmckinney closed 3 years ago

jpmckinney commented 5 years ago

Unless we still use it, we might alternately want to delete it, as kingfisher-process is the preferred option for loading OCDS data into a database.

jpmckinney commented 5 years ago

@duncandewhurst @yolile Is there any impact to deleting the tabulate command from OCDS Kit?

duncandewhurst commented 5 years ago

I'm not aware of anyone using it right now, but we have shared it via helpdesk feedback recently and via blogs and the tools directory previously.

If I remember correctly, tabulate fully flattens the data into a relational database, suitable for analysis by anyone with basic SQL skills, whereas kingfisher stores each release as a JSON blob, analysis of which which requires knowledge of postgresql JSON operators.

So I think they serve slightly different audiences

jpmckinney commented 5 years ago

Yeah – the question is whether there is a real audience for it.

I'll keep it for now. If the code review raises a lot of concerns, I might switch to recommending flatten-tool to convert to CSV and then csvkit's csvsql to load the CSV into a database.

jpmckinney commented 4 years ago

We should compare to https://github.com/better/jsonschema2db. If that other repo does some or all of what we need, then we can reduce the scope of the tabulate command.

jpmckinney commented 3 years ago

@duncandewhurst @yolile

I am not sure the effort to improve and maintain this command is worth it.

  1. There are a number of issues (#142 #174 #175).
  2. No one is known to be using it (per above comments).
  3. With https://ocds-downloads.opendata.coop, users can download a SQLite or PostgreSQL dump, which has some additional useful columns and foreign keys.

So, I am in favor of removing it, and removing any mentions in blogs, support materials, etc.

yolile commented 3 years ago

Maybe we can hold this until the new flatten tool and/or the data registry are ready? In CRM-7149 Duncan prepared a training material that uses this command, and Camila used it too in other materials and training as it is easier to use, for some cases than the flatten tool or kingfisher-process + summarize.

jpmckinney commented 3 years ago

Can you link to Camila's resources?

The new Flatten Tool and Data Registry do not have SQL outputs.

For the general case, Flatten Tool (new or old) + csvsql is a two-step process to get data into a database.

We need to be really honest about how useful a given tool is. There is not the time or resources to support every tool that shows some minimal level of use. (Use in only training events is not enough.)

Camilamila commented 3 years ago

Hi James! I used it recently to write this tutorial, i found it simpler to use for users who want to transform the json into a database format and work with the data in R or python, instead of having them to use kingfisher.

duncandewhurst commented 3 years ago

I can easily replace tabulate with https://ocds-downloads.opendata.coop/ in the tutorial I prepared, which will also significantly shorten the sections on accessing and preparing data, so this sounds good to me.

jpmckinney commented 3 years ago

Definitely no regular user should use Kingfisher Process :) It is a tool for continuous, massive, high-performance operations, not one-time operations. But I think there are simple, two-step alternatives:

  1. Convert JSON data to tabular format using Spoonbill or Flatten Tool
  2. Load the CSVs into a database using SQL COPY, csvsql, pandas, etc.

For your RPubs tutorial, SQLite directly supports importing CSVs. There are also many Python packages that can import CSV to SQLite. sqlite-utils seems most popular and offers a CLI.

jpmckinney commented 3 years ago

I can easily replace tabulate with https://ocds-downloads.opendata.coop/ in the tutorial I prepared, which will also significantly shorten the sections on accessing and preparing data, so this sounds good to me.

Yes, since the purpose of the training is to analyze data, please go ahead and simplify those sections.

I'll remove this command. It will still be available in pre-1.0 versions if people really want it.

I've updated all notebooks (Kloop, FollowTaxes, AusTender) to install ocdskit<1 so nothing breaks. I've removed the related bullet point from OCDS Kit's resource on the OCP website.

I've also created a CRM issue to update the OCDS Kit Tutorial, next time it is revised.

jpmckinney commented 3 years ago

Closed by b5a480c