Featuring bulk spreadsheet and JSON downloads more prominently in the standard documentation

duncandewhurst commented 5 years ago

Many users ultimately wish to work with OCDS data in spreadsheet format and struggle with downloading JSON data from an API and combining it into a single release package which can be converted into spreadsheet format.

Providing bulk spreadsheet downloads at source can help to address this need, so we should consider how to better emphasise this in the OCDS documentation.

Separately, we can also consider how to highlight the importance of bulk spreadsheet data more in templates, implementation resources and feedback to publishers.

cc @LindseyAM @juanpane @romifz @sdopoku

jpmckinney commented 5 years ago

Some considerations are:

Many of the same users, in my experience, don't want / can't use spreadsheets with 20+ tabs
At the same time, the same users don't want / can't use spreadsheets with 100s of columns
Excel can have at most a million rows

Regarding Excel limitations, publishers can presumably export Excel files in batches of 1M releases at a time. However, this can lead to an unworkable number of spreadsheets for these users.

Regarding an appropriate spreadsheet format, I suppose the ideal would be an interface through which the user can (1) filter the dataset; (2) select the fields they want to export (identified using local terms to avoid / postpone having to learn OCDS terms); (3) identify which groups of fields are okay to split into separate sheets; and (4) download the spreadsheet that meets their needs. However, that's a lot to ask each publisher to provide…

A market-oriented solution would be: a company creates the service described above; that company prescribes to publishers (1) how to publish bulk data for import into its systems and (2) how to annotate OCDS terms with local terms (e.g. make a local translation of the OCDS schema and codelists); finally, that company charges the publisher a fee for making their data more accessible.

A self-service solution would be: the software for the service is authored; instructions are given to publishers on how to deploy, translate and adapt the service, including (1) how to connect or import their OCDS data into it and (2) how to annotate OCDS terms as before. Each publisher then operates the service. It's not clear who pays for the creation, maintenance and improvement of the software.

For all options, I'm not confident that publishers will be willing to pay for the service, deploy the service, or build their own version…

This seems like a hard problem to solve. Bulk spreadsheet downloads (without the above interfaces) might solve the problem for some users, but I'm concerned that they will still be unusable by most users we're trying to help with this issue (too many sheets, too many columns, too many rows).

jpmckinney commented 5 years ago

An alternative is for the spreadsheets to not contain all the OCDS data, for example: only compiled releases, containing only prioritized fields (determined in some way). This won't satisfy all users, but I think it will satisfy more than spreadsheets with all the data.

LindseyAM commented 5 years ago

Thanks James.

Honduras is nice because they seem to offer both JSON and CSV bulk download options. UK is so confusing if you don't know how to use APIs.

It would be interesting to document how our current publishers are serving up their data and what features make it more or less beneficial. Not for all publishers necessarily but perhaps to highlight 'good practice'

Curious if @yolile & @romifz have thoughts on what guidance we should put in the documentation to help governments to make the data more easily 'usable' ?

yolile commented 5 years ago

@jpmckinney another solution could be that the publishers, as Honduras did, publish the complete excel files for year and then the users just remove the columns and filter the rows that they dont want to use.

Curious if @yolile & @romifz have thoughts on what guidance we should put in the documentation to help governments to make the data more easily 'usable' ?

@LindseyAM I think that maybe we can indicate the different uses that the data can have and on that recommend the publication options, something like:

Process and use the data througth computer programs -> recommend to have an API, and here put also enphasis on have a search service with multiple fields to search, etc to allow the development of apps etc
Analysis the data and investigations: recommend to have bulkdownloads in excel/csv format or at least in json format and point how this json can be converted into csv/excel for the people that want to have easly the whole dataset to make investigations and analysis
Data tables with filters to download only specific columns/rows: recommend to have a data table that have filters where the users can download (in CSV/excel or JSON) only the data that they want
Visualization for the "normal" people that may want only agregate data with agregate numbers

romifz commented 5 years ago

I think James' ideas are really useful and we could include the compiled releases into Excel/CSV option as a recommendation for publishers (and it shouldn't be difficult to implement for most of them). I even think that removing historical data may be beneficial for users that may only want to do basic analysis/calculate basic statistics on data.

Also, Honduras Excel/CSV publication works because they are using sensible criteria to split the data (year + source system), so maybe we can recommend using similar criteria as well to not exceed Excel's size limitations.

yolile commented 5 years ago

I think James' ideas are really useful and we could include the compiled releases into Excel/CSV option as a recommendation for publishers (and it shouldn't be difficult to implement for most of them). I even think that removing historical data may be beneficial for users that may only want to do basic analysis/calculate basic statistics on data.

I think that it depends, again, on the use case. Some users could want to use the historical data to find the differences in the amount, date, etc on a process in time. And other just do the basic analysis. So we can have the recommendation again mention what it is useful for what.

jpmckinney commented 5 years ago

@LindseyAM wrote:

It would be interesting to document how our current publishers are serving up their data and what features make it more or less beneficial. Not for all publishers necessarily but perhaps to highlight 'good practice'.

The problem isn't that we don't know what is good practice; we already have our own guidance on bulk downloads, and there is much more out there about open data bulk downloads in general that we can integrate or link to. The problem is that many publishers don't follow our guidance or the guidance offered by others.

Going back to @duncandewhurst's issue description, I think we can to at minimum emphasize and repeat the need to follow these practices. As part of this, we can include good examples of existing publishers. The developers involved in Kingfisher spiders (including @romifz and @yolile in this issue) can share which publishers were easier or harder to write spiders for.

We can also improve our guidance by adding examples of how different practices serve different use cases, as @yolile suggested.

However, unless we improve the rate at which publishers follow our existing guidance (which would already resolve many issues), I don't think we should spend much time adding even more practices.

That is my suggestion for this issue.

That said, my earlier comments are still relevant to large datasets (in number of releases and/or number of fields); many approaches become unusable beyond a certain size. HonduCompras 1 has only 370,297 processes, 426,463 releases (less than 75,000 per year), and 173 fields. HonduCompras' interface works because it is a small number of releases per year and a small number of fields. If it had millions of releases and hundreds more fields, it would run into the UX issues I mentioned.

With respect to @LindseyAM's issue, UK Contracts Finder has even less data, so there is no reason they can't offer the same interface as HonduCompras. UK does have bulk CSV downloads, but they are split by day (example), which is good for users who use it daily, but not for others. Contracts Finder also doesn't link to its API documentation, which is an obvious issue.

Datasets with millions of releases are: Colombia, Digiwhist Romania, Digiwhist France, Ukraine (when it was available). Uruguay (historical) and Digiwhist Portugal are approaching 1 million.

Redash queries I used:

SELECT COUNT(*)
FROM compiled_release_with_collection
WHERE collection_id = 450;

SELECT COUNT(*), LEFT(data.data->>'date', 4) AS d
FROM compiled_release_with_collection
INNER JOIN data ON data.id = data_id
WHERE collection_id = 450
GROUP BY d;

SELECT COUNT(*)
FROM release_with_collection
WHERE collection_id = 448;

SELECT COUNT(*), LEFT(data.data->>'date', 4) AS d
FROM release_with_collection
INNER JOIN data ON data.id = data_id
WHERE collection_id = 448
GROUP BY d;

SELECT COUNT(DISTINCT(path)) FROM views.field_counts WHERE collection_id = 450;

LindseyAM commented 5 years ago

Thanks! Just to clarify: when I was musing on highlighting which of OCDS publishers are doing good practice, it wasn't so that we would know what good practice is, it was so that we could call out good and bad practice in our publishers -- but acknowledge that the standard documentation isn't the place to do that so I was probably musing in the wrong place.

jpmckinney commented 5 years ago

We can provide examples in the guidance part of the documentation. There's a risk that links break, etc. but if we arrive at our desired destination of being able to update guidance in quick iterations, then that risk shouldn't be an issue.

open-contracting / standard

Featuring bulk spreadsheet and JSON downloads more prominently in the standard documentation #871