Closed larnsce closed 3 weeks ago
@mbannert: That's the workflow I am currently working off. This will be continously revised with every iteration of a new R data package. I hope to automate some of these items, at least by preparing issue templates (#14).
usethis::create_package()
library(usethis)
use_git()
git remote add origin URL
git branch -M main
git push -u origin main
library(devtools)
use_data_raw()
data-raw/
subdirectory.
data_processing.R
data-raw/
folder
data_processing.R
file
dictionary.csv
to data-raw/
with columns:
dictionary.csv
for each dataset and variable dictionary.csv
usethis::use_r
\R
folder using #[[{roxygen}]] comments
dictionary.csv
is complete use function generate_roxygen_docs
of [[{openwashdata}]] R package to document variables
usethis::use_package_doc()
DESCRIPTION
file
use_package("dplyr")
use_package("ggplot2", "Suggests")
usethis::use_cc_by()
DESCRIPTION
file
Language: en-GB
CITATION.cff
devtools
to load, document, check, and install
devtools::load_all()
"Cmd + Shift + L"devtools::document()
"Cmd + Shift + D"devtools::check()
"Cmd + Shift + E"devtools::install()
"Cmd + Shift + B"usethis::use_readme_rmd()
devtools::build_readme()
usethis::use_article("examples")
devtools::build_rmd("vignettes/articles/article.Rmd")
devtools
to load, document, check, and install
devtools::load_all()
"Cmd + Shift + L"devtools::document()
"Cmd + Shift + D"devtools::check()
"Cmd + Shift + E"devtools::install()
"Cmd + Shift + B"usethis::use_github_action_check_standard()?
pkgdown
usethis::use_pkgdown
_pkgdown.yml
pkgdown::build_site()
pkgdown::build_site()
@mbannert: I have updated the workflow to it's current form.
The steps are now a numbered list with roman numerals for the substeps that are TODO boxes to tick in my note-taking tool. I use Roam Research
Here is a screenshot of what that actually looks like:
I realise that there is a huge number of steps in there. I have highlighted the steps that I would like outsource to openwashdata/book#8 with [[{openwashdata}]]. If you need a starting point, then look those tasks.
I've worked to get automated repo generation going and looked into the GitHub. This lead me to the following thoughts:
Here's a (less detailed) suggestion: Not exactly sure how to integrate the pkgdown stage, but rather convinced of the rest.
# create data package skeleton (stage 1)
# - basic usethis R package
# - support naming and conventions check
# - create README
# - add license
# - add .cff with author info
# - add basic data processing template file
# - add data_raw folder
# - echo next steps and end first stage, maybe save this status ?
# - add citation Info
# - maybe work with a README template /w placeholder, replaced later throughout
# the process
# support data processing (stage 3)
# - helper functions to pivot data the right way
# - helper functions to extract meta data
# - generate overview of datasets in table, automatically add to README
# - set status publication ready
# statr
# fill pkgdown documentation (stage 4)
# git pushery (stage 4)
# - check status (saved to a .status.RData /.json)
# - check whether package can be built (devtools)
# - move R package to the openwashdata org
# require RSA stuff
# PAT
# how to organize repos and orgs
# openwashdata org contains
# - R package (pinned)
# - github.io page
# - pkgdown docs of the openwashdata package
# possibly have another organization 'openwashdata-submission'
# for incoming data
Thanks, @mbannert. I like the structure along Stages. My feedback:
create data package skeleton (stage 1)
- support naming and conventions check
great idea!
- add basic data processing template file
yes, that would be very helpful.
- echo next steps and end first stage, maybe save this status ?
what does status mean? what can it do?
- add .cff with author info
- add citation Info
can this be done using a GH action workflow in combination with this gist to keep things updated?
- maybe work with a README template /w placeholder, replaced later throughout the process
yes, that would be useful. Two recent examples:
what is Stage 2? The data submission process?
support data processing (stage 3)
- helper functions to extract meta data
I would that. Just in the style that you have it for swissdata.
- generate overview of datasets in table, automatically add to README
Great! Where reasonable, I would like to see all variables and descriptions listed. If it gets too large (let's say 3 dataframes, each with 10 or more variables, or a dataframe with 100 variables), then this should be moved to a article vignette (usethis::use_article()
).
- set status publication ready
Is that the same status option you list above?
statr
statR Package? https://github.com/statistikZH/statR
fill pkgdown documentation (stage 4)
git pushery (stage 4)
- check status (saved to a .status.RData /.json)
okay, I see now that the status has something to do with building the package?
- check whether package can be built (devtools)
great!
- move R package to the openwashdata org
that's assuming we have a separate organisation for data submission
openwashdata org contains
- R package (pinned)
- github.io page
- pkgdown docs of the openwashdata package
possibly have another organization 'openwashdata-submission for incoming data
I would prefer keeping data submission within the current openwashdata GH org and established issue tracker.
Let's move the development part (the items you list above + book) out of this organisations. Suggested names for new org:
Reasons:
I am fine with the org naming argument, but also no need to rush anything now. Let's rather move on for a bit before adding a new org. The main point here was to re-think our mid term focus a bit and split the submission process facilitating functionality a bit from those functions that help us putting up curated packages. In other words, I want to build functions first that help us create data packages, clean up data, add citation and README, pkgdown etc. and push to GitHub -- no matter how the data come to us. Then, in a second step when our new colleagues started working and getting used to publishing packages with our framework, I would address external data owners and their itches when submitting.
@mbannert
An update regarding the cffr package workflow. Haven't tested it yet, but this would reduce the need for:
"Write CITATION.cff and inst/CITATION with DOI entry using #gist or respective [[{openwashdata}]] R Package"
https://github.com/ropensci/cffr/issues/51#issuecomment-1523322283
@mbannert: I have taught a workshop for a reduced version of this workflow yesterday. @mianzg, @sebastian-loos, and two other GHE team members joined. I have written up the workflow as a tutorial for our GHE GitHub pages website. I am planning to do the same for the openwashdata community alongside with the additional functions that we will develop in our openwashdata package.
https://global-health-engineering.github.io/website/2-tutorials/data-package.html
@mbannert @larnsce Maybe you could point where I should put this script. I have started to write an R script to reduce the initialisation workflow of creating R data package. It's currently not in an R function style.
I just run it with command line Rscript init-pkg.R
on a terminal
Current functionality:
DESCRIPTION
title and description. Add CC-BY license by defaultdata-raw
directory with a data-processing.R
fileREADME.rmd
and remove bad lines (line 41 - end)Consider to add:
I guess this fits in the openwashdata
package for internal use. This should reduce many mouse clicks from our original tutorial.
Fantastic, thank you for the initiative.
Yes, add it to the R folder in openwashdata R package. You can use perplexity.ai to write the documentation for you. Works pretty well and then you can actually build and check the R package.
@mianzg: Let us continue the best workflow for the dictionary.csv here.
One alternative idea I had is to actually use Google Sheets for the dictionary. You could provide a public link to the person that needs to add their variables and actually provide public editing rights. Share that link with them, and then read in the variable_names and description via googlsheets4 package. Once the person is done adding their descriptions or you are done collaborating with them on it, you could change permissions back to editing rights only for us.
People won't edit a CSV. I also find it difficult to edit a plain CSV and open it with MS Excel.
@mianzg: Let us continue the best workflow for the dictionary.csv here.
One alternative idea I had is to actually use Google Sheets for the dictionary. You could provide a public link to the person that needs to add their variables and actually provide public editing rights. Share that link with them, and then read in the variable_names and description via googlsheets4 package. Once the person is done adding their descriptions or you are done collaborating with them on it, you could change permissions back to editing rights only for us.
People won't edit a CSV. I also find it difficult to edit a plain CSV and open it with MS Excel.
I don't think we should provide a public googlesheet link in a github issue. This potentially allows anyone to edit. Then this converts this approach back to email communication again. Otherwise, I think googlesheet editing is very potential to get collaborators work on it.
@mbannert @mianzg @sebastian-loos I had previously used the dataspice
package to prepare data publications. They were not published as R data packages, but had a lot of great elements that we could re-use.
I am thinking particularly of the write_spice()
function, which writes metadata from a set of CSVs into the JSON-LD:
https://docs.ropensci.org/dataspice/reference/write_spice.html
Package:
https://docs.ropensci.org/dataspice/
We should review this workflow and adapt some of it to our own needs.
Opensci is always a good outlet for packages. I'll check it thanks. If it does not include to many dependencies I am convinced...
@margauxgo @mianzg: I have documented the workflow I have in my notebook as a separate repository:
https://github.com/openwashdata-dev/workflow/blob/main/docs/index.md
Deleted this content because it's outdated. See here for updated worklow: https://github.com/openwashdata/book/issues/13#issuecomment-1516270108