ropensci / software-review

rOpenSci Software Peer Review.
286 stars 104 forks source link

dataset: Create Data Frames that are Easier to Exchange and Reuse #553

Open antaldaniel opened 1 year ago

antaldaniel commented 1 year ago

Submitting Author Name: Daniel Antal Submitting Author Github Handle: !--author1-->@antaldaniel<!--end-author1-- Repository: https://github.com/dataobservatory-eu/dataset/ Version submitted: 0.1.7 Submission type: Standard Editor: !--editor-->@annakrystalli<!--end-editor-- Reviewers: @msperlin, @romanflury

Due date for @msperlin: 2022-09-19 Due date for @romanflury: 2022-09-21

Archive: TBD Version accepted: TBD Language: en

Package: dataset
Title: Create Data Frames that are Easier to Exchange and Reuse
Date: 2022-08-19
Version: 0.1.7.3
Authors@R: 
    person(given = "Daniel", family = "Antal", 
           email = "daniel.antal@dataobservatory.eu", 
           role = c("aut", "cre"),
           comment = c(ORCID = "0000-0001-7513-6760")
           )
Description: The aim of the 'dataset' package is to make tidy datasets easier to release, 
    exchange and reuse. It organizes and formats data frame 'R' objects into well-referenced, 
    well-described, interoperable datasets into release and reuse ready form. A subjective 
    interpretation of the  W3C  DataSet recommendation and the datacube model  <https://www.w3.org/TR/vocab-data-cube/>, 
    which is also used in the global Statistical Data and Metadata eXchange standards, 
    the application of the connected Dublin Core <https://www.dublincore.org/specifications/dublin-core/dcmi-terms/> 
    and DataCite <https://support.datacite.org/docs/datacite-metadata-schema-44/> standards 
    preferred by European open science repositories to improve the findability, accessibility,
    interoperability and reusability of the datasets.
License: GPL (>= 3)
URL: https://github.com/dataobservatory-eu/dataset
BugReports: https://github.com/dataobservatory-eu/dataset/issues
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.1
Depends: 
    R (>= 2.10)
LazyData: true
Imports: 
    assertthat,
    ISOcodes,
    utils
Suggests: 
    covr,
    declared,
    dplyr,
    eurostat,
    here,
    kableExtra,
    knitr,
    rdflib,
    readxl,
    rmarkdown,
    spelling,
    statcodelists,
    testthat (>= 3.0.0),
    tidyr
VignetteBuilder: knitr
Config/testthat/edition: 3
Language: en-US

You can find the package website on dataset.dataobservatory.eu. The article Motivation: Make Tidy Datasets Easier to Release Exchange and Reuse will eventually be condensed into a JOSS paper. It has a major development dilemma.

Scope

This package is intended to give a common foundation to the rOpenGov reproducible research packages. It mainly serves communities that want to reuse statistical data (using the SDMX statistical (meta)data exchange sources, like Eurostat, IMF, World Bank, OECD...) or release new datasets from primary social sciences data that can be integrated into an SDMX compatible API or placed on a knowledge graph. Our main aim is to provide a clear publication workflow to the European open science repository Zenodo, and clear serialization strategies to RDF application.

The dataset package aims for a higher level of reproducibality, and does not detach the metadata from the R object's attributes (it is aimed to be used in other reproducible research pacakges that will directly record provenance and other transactional metadata into the attributes.) We aim to bind together dataspice and dataset by creating export functions to csv files that contain the same metadata that dataspice records. Generally, dataspice seems to be better suited to raw, observational data, while dataset for statistically processed data.

The intended use of dataset is to start correctly record referential, structural and provenance metadata retrieved by various reproducible science packages that interact with statistical data (such as the rOpenGov packages eurostat and iotables, or the oecd package.

Neither dataset or dataspice are very suitable of or documenting social sciences survey data, which are usually held in datasets. Our aim is to connect dataset, declared and DDIwR to create such datasets with DDI codebook metadata. They will create a stable new foundation of the retroharmonize package to create new, well-documented and harmonized statistical datasets from the observational datasets of social sciences surveys.

The zen4R package provides reproducible export functionality to the zenodo open science repository. Interacting with zen4R may be intimidating for the casual R user as it uses R6 classes. Our aim to provide an export function that completely wraps the workings of zen4R when releasing the dataset.

In our experience, while the tidy data standards make reuse more efficient by eliminating unnecessary data processing steps before analysis or placement in a relational database, the application of DataSet definition and the datacube model with the information science metadata standards make reuse more efficient with exchanging and combining the data with other data in different datasets.

Yes

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

ropensci-review-bot commented 1 year ago

:calendar: @romanflury you have 2 days left before the due date for your review (2022-09-21).

melvidoni commented 1 year ago

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/553#issuecomment-1250959076 time 8

ropensci-review-bot commented 1 year ago

Logged review for romanflury (hours: 8)

melvidoni commented 1 year ago

Dear all,

First of all, thanks @msperlin and @romanflury for your thoughtful reviews.

@antaldaniel I took the discussion of this package to the Associate Editors, and we arrived to the following decisions:

Regarding package names, as long as a package name it's not offensive, you are allowed to name it as whatever you want. Nevertheless, `dataset' is a very broad term that also refers to a type of data. The latter may make it especially difficult for less experienced R users, and thus a more descriptive/creative name could possible bridge this gap.

I personally have a formal background in software engineering (meaning, I am a software engineer). At first, I thought your idea of 'dummy functions' was a nice workaround to create 'interfaces' (in the sense of object-oriented programming), something I understand R does not provide (yet). Nevertheless, the peer review process is not equipped to develop a consensus on common standards across packages.

You established that this is "a metadata package that is in an early development page, still has development questions open". Moreover, you have also mentioned other packages you used as inspiration multiple times. Do note that it is undesirable to look at packages that have been onboarded a long time ago, to justify their not meeting goals at submission. This is mostly because the onboarding rules evolved over time, and the standards we have now, may not resemble those of "early-stage-rOpenSci".

Therefore, the final decision was to pause the review process to give @antaldaniel time to contact the authors of other packages. Do note that such feedback would be outside of this review. Until then, the review will be marked as on-hold on a following comment.

melvidoni commented 1 year ago

@ropensci-review-bot put on hold

ropensci-review-bot commented 1 year ago

Submission on hold!

antaldaniel commented 1 year ago

Thank you very much for the comments and the useful experience.

I created the following milestones based on this review.

  1. First CRAN release - all minor issues raised in this review are included here. I solved the smaller, not conceptual issues and add @msperlin as reviewer contributor to 0.1.8.
  2. rOpenGov review

As several functions from heavily used rOpenGov packages are moving to dataset, this wil have to be done this year and that will create a basic usability, and it will at least lay out a blueprint to integrate with rOpenGov packages, which can be useful for any data retrieval packages in rOpenSci, too.

For the enhanced usability of the packages, I created these rOpenSci relevant milestones, which are meant to conceptually review how the packages can work together.

  1. Dataset on knowledge graphs
  2. Review the integration with zen4r and dataspice.
  3. Dataset and open science releases

I expect this to be finished by the end of Q1 2023, when a draft concept paper will close this rather exploratory development phase.

We'd like to exchange data with statistical agencies and knowledge graphs on a large scale in the second half of 2023, I think that is when a first paper draft is due and the package will reach its more mature shape.

msperlin commented 1 year ago

Hi @antaldaniel,

Thanks. I appreciate the chance to contribute in making datasets better.

As for the future, it seems you still have some structural decisions to make and I'm sure you'll sort it out with time.

best,

melvidoni commented 1 year ago

Hello all, and thank you. @antaldaniel we will keep this on hold until Q1, once you have completed the package. Please, let us know by then.

annakrystalli commented 1 year ago

@ropensci-review-bot assign @annakrystalli as editor

ropensci-review-bot commented 1 year ago

Assigned! @annakrystalli is now the editor

antaldaniel commented 1 year ago

Hi @annakrystalli , just wanted to give a short update. The small changes suggested in this thread were implemented, and the early version of the package was released on CRAN. I am devising a 2-year development plan for the package and have a clear overview of planned milestones. When done, I will contact the other mentioned package owners/maintainers with this plan. With the main developers, who are not software engineers, but statisticians with statistical software development expertise, we will have a kick-off meeting in the last week of January.

annakrystalli commented 1 year ago

Ok great! Thanks for the update @antaldaniel

annakrystalli commented 1 year ago

Hello @antaldaniel ! Was wondering whether you had any updates on progress on the package?

antaldaniel commented 1 year ago

Hi @annakrystalli , there has been very little change, only in documentation; I have secured development funding and will publish a more detailed development concept and look for paid and volunteer contributors in the coming weeks. I would like to ask you what would be an excellent way to do so; apart from adding this as a vignette to this early-stage package, would it be possible to raise attention by a blog post or something similar?

annakrystalli commented 1 year ago

Great to hear you have secured development funding! You are always welcome to advertise on the rOpenSci slack, especially in the #jobs channel. Blog posts are always a good idea but the rOpenSci blog is reserved for promoting packages once they have completed review so wouldn't be appropriate at this stage.

antaldaniel commented 7 months ago

After a very long time, here is a conceptual working paper on the development with far more detailed specification than before, and some code ideas:

Making Datasets Truly Interoperable in R is a working paper to accompany develop the package.

The working paper can be referenced with:

DOI

I am also looking for volunteer and potentially paid contributors to the package.

The source file is usually more recent: dataset-working-paper.qmd`

annakrystalli commented 7 months ago

Thank you for the update @antaldaniel !

Good to hear you are making progress with the plans. Ultimately I feel the package will still remain on hold until it has been developed enough to be considered, if not ready, pretty close to release. That's when feedback from reviewers will be most useful and is also more aligned what is expected for reviewers to contribute their views on.

Let us know when you feel you have reached that stage!

antaldaniel commented 6 months ago

@annakrystalli I think that the review would be useful now, because I am implementing this working paper Making Datasets Truly Interoperable now. I just sent a new version to CRAN, but there is still room to review. Also, if somebody wants to get involved in the development, I do have a public grant for it and could take on a co-developer.

The new version (which is an entire rewrite since the first review) is on the dataset.dataobservatory.eu/ website with the connecting GitHub repo. I see a problem though with your CI attached to the package, it throws errors which to me look configuration errors and not real error, the package just builds fine on appveyor and r_hub.

ldecicco-USGS commented 4 months ago

@ropensci-review-bot check package

ropensci-review-bot commented 4 months ago

Thanks, about to send the query.

ropensci-review-bot commented 4 months ago

:rocket:

The following problem was found in your submission template:

:wave:

ropensci-review-bot commented 4 months ago

Checks for dataset (v0.3.1)

git hash: b1dca41e

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with :eyes: may be optionally addressed.)

Package License: GPL (>= 3)


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate. |type |package | ncalls| |:----------|:-------------|------:| |internal |base | 312| |internal |dataset | 178| |internal |graphics | 6| |imports |assertthat | 22| |imports |utils | 11| |imports |stats | 10| |imports |ISOcodes | NA| |suggests |dataspice | NA| |suggests |covr | NA| |suggests |declared | NA| |suggests |dplyr | NA| |suggests |eurostat | NA| |suggests |here | NA| |suggests |kableExtra | NA| |suggests |knitr | NA| |suggests |rdflib | NA| |suggests |readxl | NA| |suggests |rmarkdown | NA| |suggests |spelling | NA| |suggests |statcodelists | NA| |suggests |testthat | NA| |suggests |tidyr | NA| |suggests |tibble | NA| |suggests |nycflights13 | NA| |suggests |tsibble | NA| |suggests |data.table | NA| |linking_to |NA | NA| Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats()', and examining the 'external_calls' table.

base

as.character (40), ifelse (40), is.null (38), list (30), c (16), data.frame (14), names (10), lapply (8), attr (7), paste0 (7), inherits (6), class (5), col (5), drop (4), invisible (4), seq_along (4), which (4), as.POSIXct (3), character (3), date (3), for (3), format (3), length (3), ncol (3), Sys.time (3), unlist (3), vapply (3), all (2), args (2), as.data.frame (2), as.numeric (2), dim (2), paste (2), rbind (2), round (2), substitute (2), t (2), url (2), with (2), apply (1), as.Date (1), cbind (1), comment (1), do.call (1), environment (1), get (1), if (1), max (1), nchar (1), new.env (1), range (1), rep (1), substr (1), switch (1), Sys.Date (1)

dataset

dataset_bibentry (28), dataset_title (10), dataset (8), rights (8), subject (8), creator (7), description (6), publisher (6), identifier (5), language (5), new_Subject (5), provenance (5), xsd_convert (5), DataStructure (4), convert_column (3), publication_year (3), as_bibentry (2), as_dublincore (2), dots_number (2), geolocation (2), get_type (2), getdata (2), idcol_find (2), is_person (2), is.dataset (2), provenance_add (2), related_item_identifier (2), size (2), subject_create (2), version (2), as_datacite (1), as_dataset (1), as_dataset.data.frame (1), datacite (1), dataset_download (1), dataset_download_csv (1), dataset_prov (1), dataset_title_create (1), dataset_to_triples (1), dataset_ttl_write (1), datasource_get (1), datasource_set (1), DataStructure_update (1), describe (1), describe.dataset (1), dublincore (1), get_prefix (1), get_resource_identifier (1), head.dataset (1), id_to_column (1), initialise_dsd (1), is.datacite (1), is.datacite.datacite (1), is.dublincore (1), is.dublincore.dublincore (1), is.subject (1), new_datacite (1), new_dataset (1), new_dublincore (1), old_function (1), print.dataset (1), related_item (1), set_var_labels (1), set_var_labels.dataset (1)

assertthat

assert_that (22)

utils

bibentry (3), data (2), person (2), citation (1), object.size (1), read.csv (1), tail (1)

stats

df (5), var (3), ar (1), family (1)

graphics

title (6)

**NOTE:** Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 38 files) and - 1 authors - 12 vignettes - 3 internal data files - 4 imported packages - 81 exported functions (median 7 lines of code) - 117 non-exported functions in R (median 13 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by [the `checks_to_markdown()` function](https://docs.ropensci.org/pkgcheck/reference/checks_to_markdown.html) The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 38| 92.7| | |files_vignettes | 12| 99.6| | |files_tests | 37| 98.6| | |loc_R | 1621| 79.9| | |loc_vignettes | 805| 87.5| | |loc_tests | 567| 77.3| | |num_vignettes | 12| 99.9|TRUE | |data_size_total | 3007| 64.7| | |data_size_median | 578| 61.1| | |n_fns_r | 198| 89.7| | |n_fns_r_exported | 81| 93.6| | |n_fns_r_not_exported | 117| 86.6| | |n_fns_per_file_r | 3| 55.0| | |num_params_per_fn | 3| 33.6| | |loc_per_fn_r | 11| 32.3| | |loc_per_fn_r_exp | 7| 13.5| | |loc_per_fn_r_not_exp | 13| 42.7| | |rel_whitespace_R | 25| 85.3| | |rel_whitespace_vignettes | 36| 91.1| | |rel_whitespace_tests | 28| 81.2| | |doclines_per_fn_exp | 38| 47.0| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 128| 83.0| | ---

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

#### 3a. Continuous Integration Badges [![pkgcheck](https://github.com/dataobservatory-eu/dataset/workflows/pkgcheck/badge.svg)](https://github.com/dataobservatory-eu/dataset/actions) [![R-CMD-check.yaml](https://github.com/dataobservatory-eu/dataset/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/dataobservatory-eu/dataset/actions) **GitHub Workflow Results** | id|name |conclusion |sha | run_number|date | |----------:|:-------------|:----------|:------|----------:|:----------| | 7677839674|pkgcheck |failure |b1dca4 | 126|2024-01-27 | | 7677839676|R-CMD-check |failure |b1dca4 | 46|2024-01-27 | | 7677839673|test-coverage |failure |b1dca4 | 129|2024-01-27 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) R CMD check generated the following check_fail: 1. no_description_date #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 78.97 #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) The following function have cyclocomplexity >= 15: function | cyclocomplexity --- | --- [[.dataset | 17 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 417 potential issues: message | number of times --- | --- Avoid 1:length(...) expressions, use seq_len. | 1 Avoid 1:ncol(...) expressions, use seq_len. | 2 Avoid 1:nrow(...) expressions, use seq_len. | 3 Avoid library() and require() calls in packages | 23 Lines should not be more than 80 characters. | 384 unexpected symbol | 2 Use <-, not =, for assignment. | 2


4. Other Checks

Details of other checks (click to open)

:heavy_multiplication_x: The following 12 function names are duplicated in other packages: - - `dataset` from assemblerr, febr, robis - - `describe` from AzureVision, Bolstad2, describer, dlookr, explore, Hmisc, iBreakDown, ingredients, lambda.r, MSbox, onewaytests, prettyR, psych, psych, psyntur, questionr, radiant.data, RCPA3, Rlab, scan, scorecard, sylly, tidycomm - - `description` from dataMaid, dataPreparation, dataReporter, dcmodify, memisc, metaboData, PerseusR, ritis, rmutil, rsyncrosim, stream, synchronicity, timeSeries, tis, validate - - `identifier` from Ramble - - `is.dataset` from crunch - - `language` from sylly, wakefield - - `provenance` from provenance - - `set_var_labels` from xpose - - `size` from acrt, BaseSet, container, crmPack, CVXR, datastructures, deal, disto, easyVerification, EFA.MRFA, flifo, gdalcubes, gWidgets2, hrt, iemisc, InDisc, kernlab, matlab2r, multiverse, optimbase, PopED, pracma, ramify, rEMM, rmonad, simplegraph, siren, tcltk2, UComp, unival, vampyr - - `subject` from DGM, emayili, gmailr, sendgridr - - `var_labels` from formatters, sjlabelled - - `version` from BiocManager, garma, geoknife, mice, R6DS, rerddap, rsyncrosim, shiny.info, SMFilter


Package Versions

|package |version | |:--------|:--------| |pkgstats |0.1.3.11 | |pkgcheck |0.1.2.15 |


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

ldecicco-USGS commented 4 months ago

Hi @antaldaniel Since you mentioned "The new version (which is an entire rewrite since the first review) ", we're going to treat this as a new submission and get new reviewers. Can you work on the 2 outstanding issues above while I look for a new editor?

Thanks @annakrystalli for the initial work!

antaldaniel commented 4 months ago

@ldecicco-USGS thank you for the head up, and indeed, I will fix those issues.

ldecicco-USGS commented 3 months ago

Let me know when you've updated the package (or go ahead and rerun the "bot" command to check package. Once we've got that taken care of I'll assign a new editor. Thanks!