Open antaldaniel opened 1 year ago
:calendar: @romanflury you have 2 days left before the due date for your review (2022-09-21).
@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/553#issuecomment-1250959076 time 8
Logged review for romanflury (hours: 8)
Dear all,
First of all, thanks @msperlin and @romanflury for your thoughtful reviews.
@antaldaniel I took the discussion of this package to the Associate Editors, and we arrived to the following decisions:
Regarding package names, as long as a package name it's not offensive, you are allowed to name it as whatever you want. Nevertheless, `dataset' is a very broad term that also refers to a type of data. The latter may make it especially difficult for less experienced R users, and thus a more descriptive/creative name could possible bridge this gap.
I personally have a formal background in software engineering (meaning, I am a software engineer). At first, I thought your idea of 'dummy functions' was a nice workaround to create 'interfaces' (in the sense of object-oriented programming), something I understand R does not provide (yet). Nevertheless, the peer review process is not equipped to develop a consensus on common standards across packages.
You established that this is "a metadata package that is in an early development page, still has development questions open". Moreover, you have also mentioned other packages you used as inspiration multiple times. Do note that it is undesirable to look at packages that have been onboarded a long time ago, to justify their not meeting goals at submission. This is mostly because the onboarding rules evolved over time, and the standards we have now, may not resemble those of "early-stage-rOpenSci".
Therefore, the final decision was to pause the review process to give @antaldaniel time to contact the authors of other packages. Do note that such feedback would be outside of this review. Until then, the review will be marked as on-hold
on a following comment.
@ropensci-review-bot put on hold
Submission on hold!
Thank you very much for the comments and the useful experience.
I created the following milestones based on this review.
As several functions from heavily used rOpenGov packages are moving to dataset, this wil have to be done this year and that will create a basic usability, and it will at least lay out a blueprint to integrate with rOpenGov packages, which can be useful for any data retrieval packages in rOpenSci, too.
For the enhanced usability of the packages, I created these rOpenSci relevant milestones, which are meant to conceptually review how the packages can work together.
I expect this to be finished by the end of Q1 2023, when a draft concept paper will close this rather exploratory development phase.
We'd like to exchange data with statistical agencies and knowledge graphs on a large scale in the second half of 2023, I think that is when a first paper draft is due and the package will reach its more mature shape.
Hi @antaldaniel,
Thanks. I appreciate the chance to contribute in making datasets
better.
As for the future, it seems you still have some structural decisions to make and I'm sure you'll sort it out with time.
best,
Hello all, and thank you. @antaldaniel we will keep this on hold until Q1, once you have completed the package. Please, let us know by then.
@ropensci-review-bot assign @annakrystalli as editor
Assigned! @annakrystalli is now the editor
Hi @annakrystalli , just wanted to give a short update. The small changes suggested in this thread were implemented, and the early version of the package was released on CRAN. I am devising a 2-year development plan for the package and have a clear overview of planned milestones. When done, I will contact the other mentioned package owners/maintainers with this plan. With the main developers, who are not software engineers, but statisticians with statistical software development expertise, we will have a kick-off meeting in the last week of January.
Ok great! Thanks for the update @antaldaniel
Hello @antaldaniel ! Was wondering whether you had any updates on progress on the package?
Hi @annakrystalli , there has been very little change, only in documentation; I have secured development funding and will publish a more detailed development concept and look for paid and volunteer contributors in the coming weeks. I would like to ask you what would be an excellent way to do so; apart from adding this as a vignette to this early-stage package, would it be possible to raise attention by a blog post or something similar?
Great to hear you have secured development funding! You are always welcome to advertise on the rOpenSci slack, especially in the #jobs
channel. Blog posts are always a good idea but the rOpenSci blog is reserved for promoting packages once they have completed review so wouldn't be appropriate at this stage.
After a very long time, here is a conceptual working paper on the development with far more detailed specification than before, and some code ideas:
Making Datasets Truly Interoperable in R is a working paper to accompany develop the package.
The working paper can be referenced with:
I am also looking for volunteer and potentially paid contributors to the package.
The source file is usually more recent: dataset-working-paper.qmd`
Thank you for the update @antaldaniel !
Good to hear you are making progress with the plans. Ultimately I feel the package will still remain on hold until it has been developed enough to be considered, if not ready, pretty close to release. That's when feedback from reviewers will be most useful and is also more aligned what is expected for reviewers to contribute their views on.
Let us know when you feel you have reached that stage!
@annakrystalli I think that the review would be useful now, because I am implementing this working paper Making Datasets Truly Interoperable now. I just sent a new version to CRAN, but there is still room to review. Also, if somebody wants to get involved in the development, I do have a public grant for it and could take on a co-developer.
The new version (which is an entire rewrite since the first review) is on the dataset.dataobservatory.eu/ website with the connecting GitHub repo. I see a problem though with your CI attached to the package, it throws errors which to me look configuration errors and not real error, the package just builds fine on appveyor and r_hub.
@ropensci-review-bot check package
Thanks, about to send the query.
:rocket:
The following problem was found in your submission template:
:wave:
git hash: b1dca41e
Important: All failing checks above must be addressed prior to proceeding
(Checks marked with :eyes: may be optionally addressed.)
Package License: GPL (>= 3)
The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.
|type |package | ncalls|
|:----------|:-------------|------:|
|internal |base | 312|
|internal |dataset | 178|
|internal |graphics | 6|
|imports |assertthat | 22|
|imports |utils | 11|
|imports |stats | 10|
|imports |ISOcodes | NA|
|suggests |dataspice | NA|
|suggests |covr | NA|
|suggests |declared | NA|
|suggests |dplyr | NA|
|suggests |eurostat | NA|
|suggests |here | NA|
|suggests |kableExtra | NA|
|suggests |knitr | NA|
|suggests |rdflib | NA|
|suggests |readxl | NA|
|suggests |rmarkdown | NA|
|suggests |spelling | NA|
|suggests |statcodelists | NA|
|suggests |testthat | NA|
|suggests |tidyr | NA|
|suggests |tibble | NA|
|suggests |nycflights13 | NA|
|suggests |tsibble | NA|
|suggests |data.table | NA|
|linking_to |NA | NA|
Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(
as.character (40), ifelse (40), is.null (38), list (30), c (16), data.frame (14), names (10), lapply (8), attr (7), paste0 (7), inherits (6), class (5), col (5), drop (4), invisible (4), seq_along (4), which (4), as.POSIXct (3), character (3), date (3), for (3), format (3), length (3), ncol (3), Sys.time (3), unlist (3), vapply (3), all (2), args (2), as.data.frame (2), as.numeric (2), dim (2), paste (2), rbind (2), round (2), substitute (2), t (2), url (2), with (2), apply (1), as.Date (1), cbind (1), comment (1), do.call (1), environment (1), get (1), if (1), max (1), nchar (1), new.env (1), range (1), rep (1), substr (1), switch (1), Sys.Date (1)
dataset_bibentry (28), dataset_title (10), dataset (8), rights (8), subject (8), creator (7), description (6), publisher (6), identifier (5), language (5), new_Subject (5), provenance (5), xsd_convert (5), DataStructure (4), convert_column (3), publication_year (3), as_bibentry (2), as_dublincore (2), dots_number (2), geolocation (2), get_type (2), getdata (2), idcol_find (2), is_person (2), is.dataset (2), provenance_add (2), related_item_identifier (2), size (2), subject_create (2), version (2), as_datacite (1), as_dataset (1), as_dataset.data.frame (1), datacite (1), dataset_download (1), dataset_download_csv (1), dataset_prov (1), dataset_title_create (1), dataset_to_triples (1), dataset_ttl_write (1), datasource_get (1), datasource_set (1), DataStructure_update (1), describe (1), describe.dataset (1), dublincore (1), get_prefix (1), get_resource_identifier (1), head.dataset (1), id_to_column (1), initialise_dsd (1), is.datacite (1), is.datacite.datacite (1), is.dublincore (1), is.dublincore.dublincore (1), is.subject (1), new_datacite (1), new_dataset (1), new_dublincore (1), old_function (1), print.dataset (1), related_item (1), set_var_labels (1), set_var_labels.dataset (1)
assert_that (22)
bibentry (3), data (2), person (2), citation (1), object.size (1), read.csv (1), tail (1)
df (5), var (3), ar (1), family (1)
title (6)
base
dataset
assertthat
utils
stats
graphics
This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.
The package has: - code in R (100% in 38 files) and - 1 authors - 12 vignettes - 3 internal data files - 4 imported packages - 81 exported functions (median 7 lines of code) - 117 non-exported functions in R (median 13 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by [the `checks_to_markdown()` function](https://docs.ropensci.org/pkgcheck/reference/checks_to_markdown.html) The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 38| 92.7| | |files_vignettes | 12| 99.6| | |files_tests | 37| 98.6| | |loc_R | 1621| 79.9| | |loc_vignettes | 805| 87.5| | |loc_tests | 567| 77.3| | |num_vignettes | 12| 99.9|TRUE | |data_size_total | 3007| 64.7| | |data_size_median | 578| 61.1| | |n_fns_r | 198| 89.7| | |n_fns_r_exported | 81| 93.6| | |n_fns_r_not_exported | 117| 86.6| | |n_fns_per_file_r | 3| 55.0| | |num_params_per_fn | 3| 33.6| | |loc_per_fn_r | 11| 32.3| | |loc_per_fn_r_exp | 7| 13.5| | |loc_per_fn_r_not_exp | 13| 42.7| | |rel_whitespace_R | 25| 85.3| | |rel_whitespace_vignettes | 36| 91.1| | |rel_whitespace_tests | 28| 81.2| | |doclines_per_fn_exp | 38| 47.0| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 128| 83.0| | ---
Click to see the interactive network visualisation of calls between objects in package
goodpractice
and other checks#### 3a. Continuous Integration Badges [![pkgcheck](https://github.com/dataobservatory-eu/dataset/workflows/pkgcheck/badge.svg)](https://github.com/dataobservatory-eu/dataset/actions) [![R-CMD-check.yaml](https://github.com/dataobservatory-eu/dataset/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/dataobservatory-eu/dataset/actions) **GitHub Workflow Results** | id|name |conclusion |sha | run_number|date | |----------:|:-------------|:----------|:------|----------:|:----------| | 7677839674|pkgcheck |failure |b1dca4 | 126|2024-01-27 | | 7677839676|R-CMD-check |failure |b1dca4 | 46|2024-01-27 | | 7677839673|test-coverage |failure |b1dca4 | 129|2024-01-27 | --- #### 3b. `goodpractice` results #### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) R CMD check generated the following check_fail: 1. no_description_date #### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 78.97 #### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) The following function have cyclocomplexity >= 15: function | cyclocomplexity --- | --- [[.dataset | 17 #### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 417 potential issues: message | number of times --- | --- Avoid 1:length(...) expressions, use seq_len. | 1 Avoid 1:ncol(...) expressions, use seq_len. | 2 Avoid 1:nrow(...) expressions, use seq_len. | 3 Avoid library() and require() calls in packages | 23 Lines should not be more than 80 characters. | 384 unexpected symbol | 2 Use <-, not =, for assignment. | 2
:heavy_multiplication_x: The following 12 function names are duplicated in other packages: - - `dataset` from assemblerr, febr, robis - - `describe` from AzureVision, Bolstad2, describer, dlookr, explore, Hmisc, iBreakDown, ingredients, lambda.r, MSbox, onewaytests, prettyR, psych, psych, psyntur, questionr, radiant.data, RCPA3, Rlab, scan, scorecard, sylly, tidycomm - - `description` from dataMaid, dataPreparation, dataReporter, dcmodify, memisc, metaboData, PerseusR, ritis, rmutil, rsyncrosim, stream, synchronicity, timeSeries, tis, validate - - `identifier` from Ramble - - `is.dataset` from crunch - - `language` from sylly, wakefield - - `provenance` from provenance - - `set_var_labels` from xpose - - `size` from acrt, BaseSet, container, crmPack, CVXR, datastructures, deal, disto, easyVerification, EFA.MRFA, flifo, gdalcubes, gWidgets2, hrt, iemisc, InDisc, kernlab, matlab2r, multiverse, optimbase, PopED, pracma, ramify, rEMM, rmonad, simplegraph, siren, tcltk2, UComp, unival, vampyr - - `subject` from DGM, emayili, gmailr, sendgridr - - `var_labels` from formatters, sjlabelled - - `version` from BiocManager, garma, geoknife, mice, R6DS, rerddap, rsyncrosim, shiny.info, SMFilter
|package |version | |:--------|:--------| |pkgstats |0.1.3.11 | |pkgcheck |0.1.2.15 |
Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.
Hi @antaldaniel Since you mentioned "The new version (which is an entire rewrite since the first review) ", we're going to treat this as a new submission and get new reviewers. Can you work on the 2 outstanding issues above while I look for a new editor?
Thanks @annakrystalli for the initial work!
@ldecicco-USGS thank you for the head up, and indeed, I will fix those issues.
Let me know when you've updated the package (or go ahead and rerun the "bot" command to check package. Once we've got that taken care of I'll assign a new editor. Thanks!
Submitting Author Name: Daniel Antal Submitting Author Github Handle: !--author1-->@antaldaniel<!--end-author1-- Repository: https://github.com/dataobservatory-eu/dataset/ Version submitted: 0.1.7 Submission type: Standard Editor: !--editor-->@annakrystalli<!--end-editor-- Reviewers: @msperlin, @romanflury
Due date for @msperlin: 2022-09-19 Due date for @romanflury: 2022-09-21Archive: TBD Version accepted: TBD Language: en
You can find the package website on dataset.dataobservatory.eu. The article Motivation: Make Tidy Datasets Easier to Release Exchange and Reuse will eventually be condensed into a JOSS paper. It has a major development dilemma.
Scope
Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
Explain how and why the package falls under these categories (briefly, 1-2 sentences): Open science repositories and analyst comupters are full with datasets that have no provenance, structural or referential data. We believe that whenever possible, metadata should be machine-recorded when possible, and should not be detached from an R object.
There are several R packages that have overalapping goals or functionality to
dataset
, but they use a different philosophy. When exporting to different files, they should be written as exported, but no sooner, and preferably into the file that contains the data.Who is the target audience and what are scientific applications of this package?
This package is intended to give a common foundation to the rOpenGov reproducible research packages. It mainly serves communities that want to reuse statistical data (using the SDMX statistical (meta)data exchange sources, like Eurostat, IMF, World Bank, OECD...) or release new datasets from primary social sciences data that can be integrated into an SDMX compatible API or placed on a knowledge graph. Our main aim is to provide a clear publication workflow to the European open science repository Zenodo, and clear serialization strategies to RDF application.
The
dataset
package aims for a higher level of reproducibality, and does not detach the metadata from the R object's attributes (it is aimed to be used in other reproducible research pacakges that will directly record provenance and other transactional metadata into the attributes.) We aim to bind togetherdataspice
anddataset
by creating export functions to csv files that contain the same metadata that dataspice records. Generally, dataspice seems to be better suited to raw, observational data, while dataset for statistically processed data.The intended use of
dataset
is to start correctly record referential, structural and provenance metadata retrieved by various reproducible science packages that interact with statistical data (such as the rOpenGov packages eurostat and iotables, or the oecd package.Neither
dataset
ordataspice
are very suitable of or documenting social sciences survey data, which are usually held in datasets. Our aim is to connectdataset
, declared and DDIwR to create such datasets with DDI codebook metadata. They will create a stable new foundation of the retroharmonize package to create new, well-documented and harmonized statistical datasets from the observational datasets of social sciences surveys.The zen4R package provides reproducible export functionality to the zenodo open science repository. Interacting with
zen4R
may be intimidating for the casual R user as it uses R6 classes. Our aim to provide an export function that completely wraps the workings ofzen4R
when releasing the dataset.In our experience, while the tidy data standards make reuse more efficient by eliminating unnecessary data processing steps before analysis or placement in a relational database, the application of DataSet definition and the datacube model with the information science metadata standards make reuse more efficient with exchanging and combining the data with other data in different datasets.
Yes
If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
Explain reasons for any
pkgcheck
items which your package is unable to pass.Technical checks
Confirm each of the following by checking the box.
This package:
Publication options
[x ] Do you intend for this package to go on CRAN? -> Yes, I started the CRAN publication process, but opted to stop and get feedback from rOpenSic first
[ ] Do you intend for this package to go on Bioconductor? -> Don't know.
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
MEE Options
- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)Code of conduct