rOpenGov / pxweb

R tools to access PX-WEB API
http://ropengov.github.io/pxweb
Other
69 stars 31 forks source link

Address RDA Working Group on Dynamic Data Citation (WGDC) recommendations #266

Open pitkant opened 1 year ago

pitkant commented 1 year ago

Executive summary:

Background information:

Research Data Alliance Data Citation WG has listed 14 recommendations on data reproducibly subsetting datasets and how to cite, share and re-use these subsets:

While data retrieved from PxWeb APIs is maybe not as dynamic as other kinds of data but still occasionally changing (see stat.fi news page, there are some nice recommendations that could be at least acknowledged and, if possible, also implemented.

Here is a list of the recommendations:

Task Status Viability
R1 Data Versioning Data versioning not supported PxWeb
R2 Timestamping Timestamping dataset changes so that querying past datasets would be possible is not supported by PxWeb
R3 Query Store Facilities Some pxweb database websites have "Save your query" menu but does not include all the data that WGDC recommends it should have
R4 Query Uniqueness Pxweb interactive constructs queries in a normalised form, could also calculate MD5 hash to query
R5 Stable Sorting Dataset sorting is determined by the sorting of raw data in server
R6 Result Set Verification Fixity key for downloaded datasets, could be done with digest(dataset, algo = "md5")
R7 Query Timestamping Done Could also refer to the dataset date of last update
R8 Query PID Assign a DOI, ARK, or similar PID to a unique query
R9 Store the Query Done Refers to R3 "facilities" but query is printed by pxweb and that can be put into article appendices
R10 Automated Citation Texts Done
R11 Landing Page Now citation links to .px dataset, proper landing pages with documentation might not be available for all databases. Stat.fi has "Statistics homepage" for most (all?) datasets / topics
R12 Machine Actionability Link to metadata landing page or JSON file
R13 Technology Migration Responsibility of API / db maintainers
R14 Migration Verification Compare fixity (hash) information of queries and outputs and see if they are identical

Recommendations are grouped as follows: R1-3 "Preparing the Data and the Query Store", R4-10 "Persistently Identifying Specific Data Sets", R11-12 "Resolving PIDs and Retrieving the Data" and R13-14 "Upon modifications to the Data Infrastructure".

Especially interesting, in my opinion, would be to integrate the calculation of query and downloaded dataset hashes (R4, R6) and storing them somewhere alongside other citation data.

Additionally, R12 could be somewhat achieved by changing the URL in the following citation

  @Misc{,
    title = {Foreign languages selected by upper secondary level students by Year, Area, Gender, Level of education and Information},
    author = {{Statistics Finland}},
    organization = {Statistics Finland},
    address = {Helsinki, Finland},
    year = {2023},
    url = {https://statfin.stat.fi/PXWeb/api/v1/en/StatFin/ava/statfin_ava_pxt_12ad.px},
    note = {[Data accessed 2023-06-14 14:20:20.456548 using pxweb R package 0.16.3]},
  }

to simply https://stat.fi/en/statistics/ava which is closest equivalent to a landing page. I'm not sure if this URL is accessible from the API but it's listed at least in a separate csv file: https://statfin.stat.fi/database/StatFin/StatFin_rap.csv

R4 and R5 are kind of done if you use pxweb_interactive() as the order which items are printed in is very deterministic. If the order of query printout or dataset items is changed in any way md5 hashes change as well.

The different recommendations are, I think, most useful for Pxweb database maintainers and Pxweb developers in SCB, but we could do our own part to think about solutions to the proposed recommendations.

MansMeg commented 1 year ago

These are really good ideas!

Landing pages are good, but we should ask that as a feature from the pxweb people because it is not part of the API. We want to avoid handling individual API information. Long term, I think we should probably remove the API catalogued and just refer to pxweb list of available APIs.

pitkant commented 1 year ago

Do you mean with "pxweb list of available APIs" this list: https://www.scb.se/en/services/statistical-programs-for-px-files/px-web/pxweb-examples/ ?

As I mentioned in #254 there are some broken APIs listed there (Taiwan, Örebro kommun) and there are several APIs that were not listed there. Therefore SCB's list does not seem to be the definitive list available.


I compared the same .px and .json files downloaded from stat.fi example page and noticed that actually .px files have more metadata included than .json files. An example of this is the statistics homepage:

CONTACT[en]("Enterprise openings (No.)")="<A HREF='https://stat.fi/en/statistics/aly' TARGET=_blank>Statistics' homepage</A>";

and a note that may or may not be of interest to the data user:

NOTE[en]="<A HREF='https://stat.fi/en/statistics/documentation/aly' TARGET=_blank>Documentation of statistics</A>##.. not applicable#.. not applicable###Due to a methodological change in the source data, the number of enterprise closures and the size of the stock "
"of enterprises have not been published for the last three quarters of 2017. # #The stock of enterprises cannot be aggregated over time periods.";

which is essentially the same information that is displayed on the PxWeb database web interface.

.px-file format seems to be relatively simple and probably easy to implement, especially if it is only used to extract certain type of metadata that is not included in .json files. While this has traditionally been left out of the scope of this package, I think adding the possibility of downloading more metadata in the format of .px files would be useful. Additionally, there are some reports of JSON-stat / JSON-stat 2 output being erroneous compared to .px output (statisticssweden/PxWeb#387).

JSON-stat format allows for extension property that can be anything and interestingly enough at least stat.fi json file has several extension properties. It could also be used for storing statistics documentation (landing page) and possible notes related to statistics dataset.

EDIT: Actually it seems that PxApi 2.0 is coming out (at least to beta testing) in Autumn 2023 so maybe some of these changes will be implemented then: https://www.scb.se/en/services/open-data-api/pxapi-2.0/