ropensci / software-review

rOpenSci Software Peer Review.
292 stars 104 forks source link

dataset: Create Data Frames that are Easier to Exchange and Reuse #553

Open antaldaniel opened 2 years ago

antaldaniel commented 2 years ago

Submitting Author Name: Daniel Antal Submitting Author Github Handle: !--author1-->@antaldaniel<!--end-author1-- Repository: https://github.com/dataobservatory-eu/dataset/ Version submitted: 0.1.7 Submission type: Standard Editor: !--editor-->@annakrystalli<!--end-editor-- Reviewers: @msperlin, @romanflury

Due date for @msperlin: 2022-09-19 Due date for @romanflury: 2022-09-21

Archive: TBD Version accepted: TBD Language: en

Package: dataset
Title: Create Data Frames that are Easier to Exchange and Reuse
Date: 2022-08-19
Version: 0.1.7.3
Authors@R: 
    person(given = "Daniel", family = "Antal", 
           email = "daniel.antal@dataobservatory.eu", 
           role = c("aut", "cre"),
           comment = c(ORCID = "0000-0001-7513-6760")
           )
Description: The aim of the 'dataset' package is to make tidy datasets easier to release, 
    exchange and reuse. It organizes and formats data frame 'R' objects into well-referenced, 
    well-described, interoperable datasets into release and reuse ready form. A subjective 
    interpretation of the  W3C  DataSet recommendation and the datacube model  <https://www.w3.org/TR/vocab-data-cube/>, 
    which is also used in the global Statistical Data and Metadata eXchange standards, 
    the application of the connected Dublin Core <https://www.dublincore.org/specifications/dublin-core/dcmi-terms/> 
    and DataCite <https://support.datacite.org/docs/datacite-metadata-schema-44/> standards 
    preferred by European open science repositories to improve the findability, accessibility,
    interoperability and reusability of the datasets.
License: GPL (>= 3)
URL: https://github.com/dataobservatory-eu/dataset
BugReports: https://github.com/dataobservatory-eu/dataset/issues
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.1
Depends: 
    R (>= 2.10)
LazyData: true
Imports: 
    assertthat,
    ISOcodes,
    utils
Suggests: 
    covr,
    declared,
    dplyr,
    eurostat,
    here,
    kableExtra,
    knitr,
    rdflib,
    readxl,
    rmarkdown,
    spelling,
    statcodelists,
    testthat (>= 3.0.0),
    tidyr
VignetteBuilder: knitr
Config/testthat/edition: 3
Language: en-US

You can find the package website on dataset.dataobservatory.eu. The article Motivation: Make Tidy Datasets Easier to Release Exchange and Reuse will eventually be condensed into a JOSS paper. It has a major development dilemma.

Scope

This package is intended to give a common foundation to the rOpenGov reproducible research packages. It mainly serves communities that want to reuse statistical data (using the SDMX statistical (meta)data exchange sources, like Eurostat, IMF, World Bank, OECD...) or release new datasets from primary social sciences data that can be integrated into an SDMX compatible API or placed on a knowledge graph. Our main aim is to provide a clear publication workflow to the European open science repository Zenodo, and clear serialization strategies to RDF application.

The dataset package aims for a higher level of reproducibality, and does not detach the metadata from the R object's attributes (it is aimed to be used in other reproducible research pacakges that will directly record provenance and other transactional metadata into the attributes.) We aim to bind together dataspice and dataset by creating export functions to csv files that contain the same metadata that dataspice records. Generally, dataspice seems to be better suited to raw, observational data, while dataset for statistically processed data.

The intended use of dataset is to start correctly record referential, structural and provenance metadata retrieved by various reproducible science packages that interact with statistical data (such as the rOpenGov packages eurostat and iotables, or the oecd package.

Neither dataset or dataspice are very suitable of or documenting social sciences survey data, which are usually held in datasets. Our aim is to connect dataset, declared and DDIwR to create such datasets with DDI codebook metadata. They will create a stable new foundation of the retroharmonize package to create new, well-documented and harmonized statistical datasets from the observational datasets of social sciences surveys.

The zen4R package provides reproducible export functionality to the zenodo open science repository. Interacting with zen4R may be intimidating for the casual R user as it uses R6 classes. Our aim to provide an export function that completely wraps the workings of zen4R when releasing the dataset.

In our experience, while the tidy data standards make reuse more efficient by eliminating unnecessary data processing steps before analysis or placement in a relational database, the application of DataSet definition and the datacube model with the information science metadata standards make reuse more efficient with exchanging and combining the data with other data in different datasets.

Yes

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

ropensci-review-bot commented 2 years ago

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

ropensci-review-bot commented 2 years ago

:rocket:

The following problem was found in your submission template:

:wave:

adamhsparks commented 2 years ago

Hi, @antaldaniel, could you please fix the repo URL by providing a link to the package’s repository, please? 🙏

antaldaniel commented 2 years ago

@adamhsparks Apologies for the original issue problem, I hope all is fine now. I added both the github repo and the package website url

mpadge commented 2 years ago

@antaldaniel Then you can start the checks yourself by calling @ropensci-review-bot check package

antaldaniel commented 2 years ago

@ropensci-review-bot check package

ropensci-review-bot commented 2 years ago

Thanks, about to send the query.

ropensci-review-bot commented 2 years ago

:rocket:

Editor check started

:wave:

ropensci-review-bot commented 2 years ago

Checks for dataset (v0.1.7)

git hash: 2eb439b5

Important: All failing checks above must be addressed prior to proceeding

Package License: GPL (>= 3)


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate. |type |package | ncalls| |:----------|:-------------|------:| |internal |base | 159| |internal |dataset | 79| |internal |stats | 4| |imports |utils | 4| |imports |rlang | 1| |imports |assertthat | NA| |imports |ISOcodes | NA| |suggests |declared | NA| |suggests |dplyr | NA| |suggests |eurostat | NA| |suggests |here | NA| |suggests |kableExtra | NA| |suggests |knitr | NA| |suggests |rdflib | NA| |suggests |readxl | NA| |suggests |rmarkdown | NA| |suggests |spelling | NA| |suggests |statcodelists | NA| |suggests |testthat | NA| |suggests |tidyr | NA| |linking_to |NA | NA| Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats()', and examining the 'external_calls' table.

base

names (26), data.frame (14), class (12), paste (9), rep (7), sapply (7), unlist (6), which (6), attr (5), lapply (5), length (5), ncol (5), subset (4), as.character (3), attributes (3), c (3), logical (3), seq_along (3), vapply (3), as.data.frame (2), as.numeric (2), cbind (2), file (2), inherits (2), matrix (2), nrow (2), round (2), args (1), date (1), deparse (1), for (1), gsub (1), ifelse (1), is.null (1), paste0 (1), rbind (1), tolower (1), union (1), unique (1), url (1), UseMethod (1)

dataset

dimensions (6), attributes_measures (5), measures (5), all_unique (3), dataset_title (3), related_item (3), creator (2), datacite (2), dataset (2), dataset_source (2), description (2), geolocation (2), identifier (2), language (2), metadata_header (2), publication_year (2), publisher (2), related_item_identifier (2), resource_type (2), add_date (1), add_relitem (1), arg.names (1), attributes_names (1), bibentry_dataset (1), datacite_add (1), dataset_download (1), dataset_download_csv (1), dataset_export (1), dataset_export_csv (1), dataset_local_id (1), dataset_title_create (1), dataset_uri (1), dimensions_names (1), document_package_used (1), dot.names (1), dublincore (1), dublincore_add (1), extract_year (1), is.dataset (1), measures_names (1), print (1), print.dataset (1), resource_type_general (1), rights (1), subject (1), time_var_guess (1), version (1)

stats

df (2), time (2)

utils

citation (1), object.size (1), read.csv (1), sessionInfo (1)

rlang

get_expr (1)

**NOTE:** Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 26 files) and - 1 authors - 7 vignettes - no internal data file - 4 imported packages - 56 exported functions (median 10 lines of code) - 82 non-exported functions in R (median 15 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by [the `checks_to_markdown()` function](https://docs.ropensci.org/pkgcheck/reference/checks_to_markdown.html) The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 26| 87.0| | |files_vignettes | 7| 98.5| | |files_tests | 27| 97.6| | |loc_R | 1000| 68.2| | |loc_vignettes | 676| 84.7| | |loc_tests | 371| 68.8| | |num_vignettes | 7| 99.2|TRUE | |n_fns_r | 138| 83.6| | |n_fns_r_exported | 56| 89.5| | |n_fns_r_not_exported | 82| 79.7| | |n_fns_per_file_r | 3| 55.0| | |num_params_per_fn | 2| 11.9| | |loc_per_fn_r | 15| 46.1| | |loc_per_fn_r_exp | 10| 22.2| | |loc_per_fn_r_not_exp | 15| 49.5| | |rel_whitespace_R | 27| 78.3| | |rel_whitespace_vignettes | 36| 88.3| | |rel_whitespace_tests | 25| 70.7| | |doclines_per_fn_exp | 39| 48.6| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 103| 79.7| | ---

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

--- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) R CMD check generated the following check_fail: 1. no_description_date #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 67.81 The following files are not completely covered by tests: file | coverage --- | --- R/creator.R | 64.29% R/datacite_attributes.R | 0% R/datacite.R | 46.88% R/dataset_uri.R | 0% R/dataset.R | 48.36% R/document_package_used.R | 0% R/dublincore.R | 67.74% R/publication_year.R | 55.56% R/related_item.R | 66.67% #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) The following functions have cyclocomplexity >= 15: function | cyclocomplexity --- | --- datacite_add | 24 dublincore_add | 23 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 383 potential issues: message | number of times --- | --- Avoid 1:ncol(...) expressions, use seq_len. | 4 Avoid library() and require() calls in packages | 20 Avoid using sapply, consider vapply instead, that's type safe | 4 Lines should not be more than 80 characters. | 352 Use <-, not =, for assignment. | 3


4. Other Checks

Details of other checks (click to open)

:heavy_multiplication_x: The following 10 function names are duplicated in other packages: - - `dataset` from assemblerr, febr, robis - - `description` from dataMaid, dataPreparation, dataReporter, dcmodify, memisc, metaboData, PerseusR, ritis, rmutil, rsyncrosim, stream, synchronicity, timeSeries, tis, validate - - `dimensions` from gdalcubes, openeo, sp, tiledb - - `identifier` from Ramble - - `is.dataset` from crunch - - `language` from sylly, wakefield - - `measures` from greybox, mlr3measures, tsibble - - `size` from acrt, BaseSet, container, crmPack, CVXR, datastructures, deal, disto, easyVerification, EFA.MRFA, flifo, gdalcubes, gWidgets2, hrt, iemisc, InDisc, kernlab, matlab2r, multiverse, optimbase, PopED, pracma, ramify, rEMM, rmonad, simplegraph, siren, tcltk2, UComp, unival, vampyr - - `subject` from DGM, emayili, gmailr, sendgridr - - `version` from BiocManager, garma, geoknife, mice, R6DS, rerddap, rsyncrosim, shiny.info, SMFilter


Package Versions

|package |version | |:--------|:--------| |pkgstats |0.1.1.20 | |pkgcheck |0.1.0.3 |


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

adamhsparks commented 2 years ago

Hi again, @antaldaniel. If you could please address the issues that the bot flagged with the ✖️, then I can proceed with your submission.

antaldaniel commented 2 years ago

Hi @adamhsparks I hope I managed to add these things, with the following exception.

✔️does not have a 'codemeta.json' file -> added with codematar. ✔️does not have a 'contributing' file -> added CONTRIBUTING.md ✔️ These functions do not have examples: [attributes_measures]. -> added ✖️ Function names are duplicated in other packages

I tried to avoid duplications while keeping in mind rOpenSci duplication guildelines, and at this point, I do not see which are the dupblications and if there is any sensible way to resolve them.

Your guidelines state "Avoid function name conflicts with base packages or other popular ones (e.g. ggplot2, dplyr, magrittr, data.table)" The package currently has no name conflict with any packages that I was thinking of to be used together, and I do not know how to test for this. (Apolgoies if this is somewhere in the 1.3 Package API)

✔️ Package has no continuous integration checks -> added ✖️ Package coverage is 67.8% (should be at least 75%)

I do not see a sensible way to achieve 75%+ codecov coverage with a metadata package that is in an early development page, still has development questions open (see Motivation: Make Tidy Datasets Easier to Release Exchange and Reuse, hence the submission here before the first CRAN release). For example, in the target category, other metadata management pacakges like codemetar has a 42% coverage, EML has 65%, both below the current coverage before the first release of dataset.

mpadge commented 2 years ago

@antaldaniel You may indeed ignore the "Function names are duplicated in other packages." That will soon be changed from a failing check (:heavy_multiplication_x:) to an advisory note only. Sorry for any confusion there. @adamhsparks will comment further on the code coverage.

antaldaniel commented 2 years ago

@mpadge I do not seem to find the output where this informaiton is coming from, but I think that it is nevertheless a very useful reminder, and it would be good to see what conflicts your bot has found. Again, apologies if I ask the obvious, but where can I check what duplicates were flagged by your bot?

mpadge commented 2 years ago

It's in the check results. Under "4. Other Checks", you'll see a "Details of other checks (click to open)". You can also generate those yourself by running:

library(pkgcheck)
checks <- pkgcheck("/<path>/<to>/<dataset-pkg>")
checks_md <- checks_to_markdown(checks, render = TRUE)

That will automatically open a HTML-rendered version of the checks, just like the above. You can use that repeatedly as you work through the issues highlighted above.

antaldaniel commented 2 years ago

@mpadge Oh, really, sorry for asking the obvious.

I would like to comment here on the issue then in substance. The main development question of the package, which aims to make R objects standard datasets (as defined by W3C and SDMX), is to add structural and referential metadata, is if the best way to do this is to create an s3 object or not (see the dilemma here.)

In the current stage, it is a pseudo object inherited from data.frame, but it can be seen also as a utility to any data.frame, tibble, and data.table (or similar tabular format) R objects. The functions, which have duplicates in other packages, are following a very simple naming convention. I think that these is the cleanest API interface that I can think of, for example, the

subject() gets the metadata attribute Subject and the subject<-() sets it. As DataCite, Dublin Core and schema.org has dozens of potential attributes, to me the easiest is to use in a slightly modified form the name of the attribute to set/get its value.

All these functions are lowercase to manipulate a camelCase standard attribute. Except for the SDMX attribute 'attribute', which would create a conflict with the base R 'attributes()' function.

adamhsparks commented 2 years ago

Hi @antaldaniel, I can understand the difficulty in writing tests for such a non-standard package. But I've had a look at covr::report() for "dataobservatory-eu/dataset". I think that there is still low-hanging fruit here that can be covered to get your code-coverage up to 75% that we ask for.

For instance, Lines 40-43 are covered but Lines 44-45 aren't. These are seemingly the same except for checking on 2 or 3 letter ISO codes, unless I'm mistaken.

Or the message response within the stop() functions in the same file aren't checked.

Could I ask that you have another look and see if you can't further improve the coverage a bit more?

antaldaniel commented 2 years ago

Hi @adamhsparks I went up to 71.27%, but further changes are not very productive. I did not extensively cover two areas, one is the constructor for the dataset() itself, where I expect potentially breaking changes, and in the file I/O areas, where I think I would like to come up with a more general solution, and also avoid test being run on CRAN later. As the overwrite function and its messages make the most branches, this is a bit of a play with %, as the very same copied test is tested again and again.

Do you have a good solution to include download and file I/O tests that run fast enough or cause no disruption when later run on CRAN?

antaldaniel commented 2 years ago

@adamhsparks I am much above your treshold, and apologies for the trivial error. I wanted to omit some issues in the dataset() construtor, but I did not realize that it had some old code that had been rewritten - the test were omitting them, of course, but they sat at the bottom of the file. It is now 81.2% covered, I know that it has to improve, but I'd prefer to do it when some issues are resolved in a clear direction (see my comment above.)

adamhsparks commented 2 years ago

Hi @antaldaniel, that's great to see. Thank you for rechecking everything and updating.

If you have tests that you feel are unconducive for CRAN, I'd just use (and do liberally use) skip_on_cran(). Reviewers should hopefully be able to help guide you on this more.

adamhsparks commented 2 years ago

@ropensci-review-bot check package

ropensci-review-bot commented 2 years ago

Thanks, about to send the query.

ropensci-review-bot commented 2 years ago

:rocket:

Editor check started

:wave:

ropensci-review-bot commented 2 years ago

Checks for dataset (v0.1.7.0002)

git hash: 93c03c54

Important: All failing checks above must be addressed prior to proceeding

Package License: GPL (>= 3)


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate. |type |package | ncalls| |:----------|:-------------|------:| |internal |base | 147| |internal |dataset | 66| |internal |stats | 2| |imports |utils | 2| |imports |assertthat | NA| |imports |ISOcodes | NA| |suggests |covr | NA| |suggests |declared | NA| |suggests |dplyr | NA| |suggests |eurostat | NA| |suggests |here | NA| |suggests |kableExtra | NA| |suggests |knitr | NA| |suggests |rdflib | NA| |suggests |readxl | NA| |suggests |rmarkdown | NA| |suggests |spelling | NA| |suggests |statcodelists | NA| |suggests |testthat | NA| |suggests |tidyr | NA| |linking_to |NA | NA| Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats()', and examining the 'external_calls' table.

base

names (21), class (12), data.frame (10), paste (9), vapply (9), rep (7), character (6), unlist (6), attr (5), lapply (5), length (5), ncol (5), subset (4), as.character (3), c (3), seq_along (3), as.data.frame (2), as.numeric (2), attributes (2), cbind (2), file (2), inherits (2), logical (2), matrix (2), nrow (2), round (2), which (2), date (1), for (1), ifelse (1), is.null (1), paste0 (1), rbind (1), seq_len (1), tolower (1), union (1), unique (1), url (1), UseMethod (1)

dataset

attributes_measures (5), dimensions (4), all_unique (3), dataset_title (3), measures (3), creator (2), datacite (2), dataset (2), dataset_source (2), description (2), geolocation (2), identifier (2), language (2), metadata_header (2), publication_year (2), publisher (2), related_item_identifier (2), resource_type (2), bibentry_dataset (1), datacite_add (1), dataset_download (1), dataset_download_csv (1), dataset_export (1), dataset_export_csv (1), dataset_local_id (1), dataset_title_create (1), dataset_uri (1), dublincore (1), dublincore_add (1), extract_year (1), is.dataset (1), print (1), print.dataset (1), related_item (1), resource_type_general (1), resource_type_general_allowed (1), rights (1), subject (1), time_var_guess (1), version (1)

stats

df (2)

utils

object.size (1), read.csv (1)

**NOTE:** Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 24 files) and - 1 authors - 7 vignettes - no internal data file - 3 imported packages - 56 exported functions (median 10 lines of code) - 66 non-exported functions in R (median 15 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by [the `checks_to_markdown()` function](https://docs.ropensci.org/pkgcheck/reference/checks_to_markdown.html) The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 24| 85.5| | |files_vignettes | 7| 98.5| | |files_tests | 28| 97.7| | |loc_R | 889| 64.9| | |loc_vignettes | 676| 84.7| | |loc_tests | 432| 72.0| | |num_vignettes | 7| 99.2|TRUE | |n_fns_r | 122| 81.1| | |n_fns_r_exported | 56| 89.5| | |n_fns_r_not_exported | 66| 74.6| | |n_fns_per_file_r | 3| 54.4| | |num_params_per_fn | 2| 11.9| | |loc_per_fn_r | 11| 32.3| | |loc_per_fn_r_exp | 10| 22.2| | |loc_per_fn_r_not_exp | 15| 49.5| | |rel_whitespace_R | 27| 75.4| | |rel_whitespace_vignettes | 36| 88.3| | |rel_whitespace_tests | 28| 76.4| | |doclines_per_fn_exp | 39| 48.6| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 103| 79.7| | ---

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

#### 3a. Continuous Integration Badges [![pkgcheck](https://github.com/dataobservatory-eu/dataset/workflows/pkgcheck/badge.svg)](https://github.com/dataobservatory-eu/dataset/actions) **GitHub Workflow Results** | id|name |conclusion |sha | run_number|date | |----------:|:-------------|:----------|:------|----------:|:----------| | 2891146042|pkgcheck |failure |93c03c | 17|2022-08-19 | | 2891146050|test-coverage |success |93c03c | 20|2022-08-19 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) R CMD check generated the following check_fail: 1. no_description_date #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 82.12 #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) The following functions have cyclocomplexity >= 15: function | cyclocomplexity --- | --- datacite_add | 24 dublincore_add | 23 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 370 potential issues: message | number of times --- | --- Avoid library() and require() calls in packages | 20 Lines should not be more than 80 characters. | 350


4. Other Checks

Details of other checks (click to open)

:heavy_multiplication_x: The following 10 function names are duplicated in other packages: - - `dataset` from assemblerr, febr, robis - - `description` from dataMaid, dataPreparation, dataReporter, dcmodify, memisc, metaboData, PerseusR, ritis, rmutil, rsyncrosim, stream, synchronicity, timeSeries, tis, validate - - `dimensions` from gdalcubes, openeo, sp, tiledb - - `identifier` from Ramble - - `is.dataset` from crunch - - `language` from sylly, wakefield - - `measures` from greybox, mlr3measures, tsibble - - `size` from acrt, BaseSet, container, crmPack, CVXR, datastructures, deal, disto, easyVerification, EFA.MRFA, flifo, gdalcubes, gWidgets2, hrt, iemisc, InDisc, kernlab, matlab2r, multiverse, optimbase, PopED, pracma, ramify, rEMM, rmonad, simplegraph, siren, tcltk2, UComp, unival, vampyr - - `subject` from DGM, emayili, gmailr, sendgridr - - `version` from BiocManager, garma, geoknife, mice, R6DS, rerddap, rsyncrosim, shiny.info, SMFilter


Package Versions

|package |version | |:--------|:--------| |pkgstats |0.1.1.20 | |pkgcheck |0.1.0.3 |


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

adamhsparks commented 2 years ago

@ropensci-review-bot assign @melvidoni as editor

ropensci-review-bot commented 2 years ago

Assigned! @melvidoni is now the editor

melvidoni commented 2 years ago

@ropensci-review-bot seeking reviewers

ropensci-review-bot commented 2 years ago

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/553_status.svg)](https://github.com/ropensci/software-review/issues/553)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

antaldaniel commented 2 years ago

For clarity, apart from adding the README badge, I made a few URL corrections and added a paragraph to the Movtivation vignette.

melvidoni commented 2 years ago

@ropensci-review-bot assign @duttashi as reviewer

ropensci-review-bot commented 2 years ago

@duttashi added to the reviewers list. Review due date is 2022-09-13. Thanks @duttashi for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot commented 2 years ago

@duttashi: If you haven't done so, please fill this form for us to update our reviewers records.

melvidoni commented 2 years ago

@ropensci-review-bot assign @msperlin as reviewer

ropensci-review-bot commented 2 years ago

@msperlin added to the reviewers list. Review due date is 2022-09-19. Thanks @msperlin for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot commented 2 years ago

@msperlin: If you haven't done so, please fill this form for us to update our reviewers records.

melvidoni commented 2 years ago

@ropensci-review-bot remove @duttashi from reviewers

ropensci-review-bot commented 2 years ago

@duttashi removed from the reviewers list!

melvidoni commented 2 years ago

@ropensci-review-bot assign @romanflury as reviewer

ropensci-review-bot commented 2 years ago

@romanflury added to the reviewers list. Review due date is 2022-09-21. Thanks @romanflury for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot commented 2 years ago

@romanflury: If you haven't done so, please fill this form for us to update our reviewers records.

msperlin commented 2 years ago

title: "review" output: rmarkdown::md_document: pandoc_args: [ "--wrap=none" ]

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing:


Review Comments

Overall, I find the package interesting and well done. In my assessment, the R code is good and passes all CHECKS and tests. However, as with any work in development, there are some points to improve. Please find my suggestions next.

melvidoni commented 2 years ago

@ropensci-review-bot submit review time 4

ropensci-review-bot commented 2 years ago

Logged review for msperlin (hours: 4)

antaldaniel commented 2 years ago

Thank you very much for the review. You must see this package as a very early concept, and this will be the platform of many pacages of the rOpenGov collaboration (many of which could also be part of rOpenSci).

For me, the most important questions are what I put into the dilemmas:

The criteria for this is best integration with the rOpenSci packages that handle metadata, or create a binding to RDF, and connection to zen4R that uses R6 classes.

I do not find the w3c dataset or the datacube names very descriptive, because we will provide conversion to not only w3c but sdmx formats, and not only datacubes, but any dataset.

The smaller remarks on the functions are welcome, but again, these are just dummy functions to decide into which direction to develop the package to connect well with metadata, RDF, and open science deposition tools.

msperlin commented 2 years ago

Daniel,

should the package build a new dataset class, and if yes, should it be inherited from data.frame or tibble? or alternatively just create functions that can improve a data.frame, tibble, or data.table object.

This is a design choice. Im my oppinion, you'll be fine in any direction, as long as the code/input is flexible towards existing classes.

I do not find the w3c dataset or the datacube names very descriptive, because we will provide conversion to not only w3c but sdmx formats, and not only datacubes, but any dataset.

Ok.

The smaller remarks on the functions are welcome, but again, these are just dummy functions to decide into which direction to develop the package to connect well with metadata, RDF, and open science deposition tools.

Fair enough. But, core functions will be mostly simillar in any design choice you make, so it seems right to find their best version right now. If you are not willing to change them because they are "dummy" functions, it would have been nice to let us know before the review.

ropensci-review-bot commented 2 years ago

:calendar: @msperlin you have 2 days left before the due date for your review (2022-09-19).

antaldaniel commented 2 years ago

I am a little bit puzzled, because you suggest to submit packages before they are on CRAN and developed to a detail that it is hard to make braking changes. So I did, with far more documentation on future plans than the documentation of this skeleton package, and made explicit development questions, particularly how to integrate best the package with dataspice and rdflib, which are rOpenSci packages, and our package will add a lot to their functionality, plus bridge them to several dozen rOpenGov packages. .

https://dataset.dataobservatory.eu/articles/index.html Particularly here: https://dataset.dataobservatory.eu/articles/motivation.html

As the package is intended to work well with two rOpenSci and several rOpenGov packages (which, in turn, would be very low hanging fruits to bring into rOpenSci as well, some of them are very well established packages with many thousand users in Europe), I really did not want to start to work out I/O function before clarifying what would be the best structure to import into.

  1. In my view, dataset can add a lot of to dataspice , because it has a non-reproducible workflow, and chose a metadata scheme that is very general but not usually used in science
  2. We want to create a reproducible tool that converst statistical findings into RDF format and feed into rdflib, or import from the increasing number of RDF resources. As the European open data portal and Eurostat is moving into this direction, and we provide APIs for those, we could bring in a lot of use cases for rdflib, which, I think, is a sleeping giant, and even its documentation acknowledges that it is still missing a good use case.

I think that it is important that we can create a data frame that works well with web services, and which is at least compatible with data.table. My most fundamental development question is to make this another S3 class, or build on a well-defined object, such as data.frame or tibble. If anyone could pass on a view on tibble vs data.frame it would be particularly useful. I am working only in the tidyverse, and I do not feel how much is the value of backward compatibility and lean packaging towards data.frame.

melvidoni commented 2 years ago

@antaldaniel @msperlin Thank you for the discussion, it's very valuable. However, please keep it framed on this particular package, which is the focus of this review. Remember that reviews are not adversarial, but constructive feedback towards making the package the best it can be.

antaldaniel commented 2 years ago

@msperlin Thank you for your comment on the unit test and the spelling, which I will implement now. I will also create a new README.

This leads me to the comment of @melvidoni . This package is not supposed to work in isolation, as it is a metadata package, and the main development dilemma is how best connect to the dataspice, rdflib,and zen4r packages (these are rOpenSci packages) , as outlined in the documentation. I see these as the real export routes, and not writing a csv file.

If there is any misunderstanding about the development goals of the package, I will try to reflect on the comments, to make them easier to implement.

I believe that the other comments are all related to the basic question that I have asked in the documentation of the package, i.e. what should be the s3 class of the dataset open, which as a consequences leads to the question of a suitable print method and a unit test for parameter x. I will just rephrase the dilemma that is implicilty left open, and unless there is any further comment on the issues below, I will leave the current design untouched: dataset() will be an s3 constructor inherited from data.frame, and I will remain as close to base R as possible.

Checking input x -- again this goes back to the dilemma of dependencies, and the main class, and I am not really sure what should be the limitations on the parameter x. @msperlin's suggestion seems to side with the argument to make it a data.frame like object. This is close to the current design, where dataset is an s3 class inherited from data.frame., and the test could be in this case for anything inherited from data.frame.

An alternative is not to make any real restrictions, and do not treat the datasets as an s3 class. In this case x can be any R object, however, this relates to a very basic design question: should the dataset be an s3 class with methods, including, of course, a nicer print method as you suggested, or it should work with any R objects? Dataset can be placed into a list or a JSON object, a table, too. There could be strong arguments to allow any R objet that has at least a two-dimensional structure to be x, i.e. the data container of the dataset. And the current quasi-methods of the dataset class remain helper functions.

Import comments As for imports, and related to the README, the first use cases will be reproducible research packages, like the rOpenGov pacakges, which will hand over a tibble or a data.frame, not a file. Further potential import routes could be the rOpenSci data access packages. The aim would be to make the output of such packages available for machine to machine communication on the web, hence improving interoperability, and also findability and reusability. This relates to the remarks in the review about 'selling' the package -- this package will be the foundation of many packages that have a wide use. As this will be an upstream package, and a dependency of other pacakges, the README will mainly aim at package developers.

Re: why we don't see a dataset_import_csv() ? Going back to the previous dilemma, this function in the current design would have no added value apart from possibly retainin the path where of the original csv file. The dataset() is an s3 class constructor that is not working from a file but an R object (which is most likely a data.frame). I see value in importing if the source file has valuable metadata. A good use case for this is the use of primary data collected by social science surveys. We are developing parallel a package for this, as in that case not only W3C/schema.org/SDMX but also DDI compliance is needed [DDI defines data documentation]. This is explained in some detail in the The survey Class article. The general question, again, if this is practical at all using s3 classes?

Tthere is already a csv based metadata package in rOpenSci, i.e. dataspice. Connecting the two packages would be valuable indeed, because dataspice currently offers a not really reproducible, multiple-csv file based solution to the problems that dataset solves with the dataset() constructor. The creators of dataspice use a different dataset definition, and this could be easily reconciled. The dataspice could be recommended for users who are seeking a spreadsheet based, manual solution and dataset to those who seek a programmatic approach. There are plenty of other synergies, dataset could rely on the html output of dataspice, while dataset could provide a release route for dataspice in machine-to-machine situations.

As for using readr as a dependency, I would like to keep the core dataset package as lean as possible, and rather reduce than increase the dependencies of the package. I have an alternate version that could work with any i/o function treating the suggested readr::write_csv and data.table::fwrite as .f function parameters, similarly to purrr. However, I stayed with base R for now as it is undecided what should be the expected class of x, and if the datasets should be an s3 class at all, and if yes, should they be tibbles or data frames. Moving towards tidyverse would make the use of write_csv logical, of course. But again, I think that the exporting should be possibly done by dataspice, rdflib,and zen4r. Currently the dataset is, after all, a data.frame that can be written into a file with any package that user likes.

romanflury commented 2 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing:


Review Comments

In general, I like the idea of the package dataset and how the functions are implemented. In my opinion, it will be a valuable addition to the rOpenSci package network. Especially the manual page of each function is written in great detail and care, which eases the use of this package immensely. All major points from the checklist above are fulfilled.

Based on the author’s comments in this issue, it is my understanding that at this development state, not a full fletched review of this package is asked but the discussion of design and structure on a metalevel. In the following, you can find my personal suggestions and some minor comments for other improvements.

what should be the s3 class of the dataset

I support inheriting the S3 constructor from data.frame and keeping everything as close to base R as possible. In my opinion, this is the most flexible approach and the way to go if one seeks to have a generic package. But at the end of the day, this is probably also a matter of taste.

export routes

You mention this package is to be understood as a metadata package and defined as export routes the packages dataspice, rdflib,and zen4r. Maybe you could document how and where the dataset package fits into the entire process provided by these packages and what are best practices, e.g., in the spirit of https://cran.r-project.org/web/packages/dataspice/vignettes/overview.html section 2.

In the context of processing a dataset, do you think it is possible to streamline this process by concatenating multiple functions? Or do you see the package more as a toolbox of functions? Dependent on this answer, one could bring back the discussion about the name of the package. In my opinion, the name could be improved to indicate its specific purpose, as mentioned above by @msperlin.

dublincore_add()

This function allows to specify the Creator attribute and thereby uses the function person(), which allows many different roles. However, for the attribute Creator this role should at least by default be “aut” or even limited to a few roles (https://stat.ethz.ch/R-manual/R-devel/library/utils/html/person.html).

geolocation()

It would be valuable if not only the geolocation as a string but also geoLocation points as described in https://support.datacite.org/docs/datacite-metadata-schema-v44-recommended-and-optional-properties section 18ff.

I hope this helps with deciding on the general strategy and continuing with the development of this package! Best wishes, roman

antaldaniel commented 2 years ago

@romanflury Thank you very much, it is very helpful and will make the changes.
In general, this is a reverse package engineering as we are moving out from well-established packages a few functions that we think could support other similar packages. So it is indeed a very early development phase, but with some mature elements that we would just like to generalize for a wider user base.

About your question to concatenating functions, if I understand correctly, we would like to do three things.

a) Genearlize the tidy dataset concept, and make sure that the user creates truly tidy dataset. This may be easier done with dplyr and tidyr, and probably merit the switch to tibble to data.frame. We have a bunch of packages at rOpenGov that already do this, and we want to create a 'container' for them. b) Make sure that the dataset has enough metadata that it can truly be machine processed and reused (this very much overlaps with dataspice, but we use definitions of global libraries and statistical agencies, to me dataspice is more of a 'genereric' schema, probably corporate oriented. This is done, but as @msperlin pointed out, for example, it prints ugly. c) Then send the data through the zen4r [into an open repository] or the rdflib serialization way [place it on a web 3.0 knowledge graph], or publish via dataspice [which is a good web 1.0 solution for the same.] The csv export is in fact an exception for manual release, such as upload to the Zenodo where zen4r provides a programatic upload. Another easy export route is ddbplyr or rsqlite into relational databases.

These are not very complicated things, for example, I think that zen4r does all what we need, but it uses an R6 class that can be intimidating for some users, and basically we will just copy the dataset s3 object into the R6 class of zen4r.

The problem with data.frames, or generally, with a 'traditional' aproach of tables or databases (multiple tables) is that they are missing semantic information and various administrative information (reuse rights, authorship, etc.) that would make the data farmes reusable by another user or entity. We want to create an object that 'carries' a dataset in a way that it can be machine read, released, reused.

The reason why I prefer the more generic 'dataset' name is that we currently apply to important 'dataset' definitions, that of the W3C and SDMX, and we could easily add the third, i.e the schema.org definition used by dataspice. So, after all, the package makes sure that the packaging of a dataset as stored in a data.frame, a data.table or a tibble actually conforms the 'dataset' definitions of schema.org (general), SDMX (statistical agencies and applications) and W3C (semantic web, knowledge graphs), and the 'table' definition of relational databases.

The closest mental concept to what we are doing is the tidy dataset, but not so much on focusing on tidiness but on interoperabiilty and reusability helped by standard metadata and semantic description. Of course, any reference to the tidyverse or tidyr would be rather superstitious. The datacube is very particular to SDMX and has a bit of a1990s feeling. The 'metadata' is not descriptive as there are many forms of metadata, and we do not only work with w3c dataset definitions, but SDMX, too, and eventually schema.org, too. I actually had considered all data variations of @msperlin earlier, but I think that they are not descriptive enough. While the consensus is on an inherited s3 class from data.frame, in fact, the problem that we solve could be a problem of any R object that has a dataset structure, which means that it is organized in rows and columns, but lacks the documenation/metadata/semantic information to be cited, confirmed to be lawfully reused, and understood by a machine to import it into a database or find it.

I think I also have to make a disclaimer, the package is not funded at the moment, but we received a very significant European grant for the next three years to build an ecosystem of packages for promoting open social sciences in Europe, and dataset is meant to be a 'carrier' of data.frames for many applications with a relatively wide user base.