Deleted this content because it's outdated. See here for updated worklow: https://github.com/openwashdata/book/issues/13#issuecomment-1516270108

@mbannert: That's the workflow I am currently working off. This will be continously revised with every iteration of a new R data package. I hope to automate some of these items, at least by preparing issue templates (#14).

Email (probably)
- To get started, please
  - get an account on GitHub: https://github.com/
  - open an issue for your data, so that we can communicate on using GitHub: https://github.com/openwashdata/data/issues
  - think about a name for your data package. The name should:
    - have all small letters
    - have no spaces or dashes
    - be a combination of two to three words
    - identify location and/or theme/topic
Open GitHub
1. If data donator has not opened an issue on openwashdata/data issue tracker, do it yourself
2. Decide on a name for the repository and corresponding R data package
3. Create a new repository with the following settings
  - Public
  - Do not add a README
  - Do not add a .gitignore
  - Do not add a LICENSE
4. Invite the contributor as a collaborator on this repository
5. Inform contributor that they need to accept the invitation to contribute (probably by Email)
Open RStudio IDE
1. Check if R Packages [[{devtools}]] and [[{usethis}]] are installed. Install, if they are not.
2. Create a new project using the R Package [[{devtools}]] template and the same name you used on GitHub
  - If project folder already exists, open it and use
    - usethis::create_package()
  - If project folder does not yet exist, use
    - File -> New Project -> New Directory -> R Package using devtools -> Choose directory name and location of sub-directory
3. Add git version control to local directory
  - In Console, execute
    - library(usethis)
    - use_git()
      - yes, commit
      - yes, restart
4. Connect remote repository on GitHub with local repository
  - git remote add origin URL
  - git branch -M main
  - git push -u origin main
5. Add directory for raw-data to project
  - In Console, execute
    - library(devtools)
    - use_data_raw()
      - This will create a data-raw/ subdirectory.
        
        contains a DATASET.R file
        
        rename to data_processing.R
6. Add, commit and push all changes to GitHub
7. On GitHub, open issue 1 for adding data to data-raw/ folder
  - use an issue template (example: https://github.com/openwashdata/cbssuitabilityhaiti/issues/1#issue-1657507593)
8. Prepare import of data and export of tidy raw data in data_processing.R file
  - At the end of file add export for CSV and XLSX
    - usethis::use_data(DATASET, overwrite = TRUE)
    - fs::dir_create(here::here("inst", "extdata"))
    - write_csv(DATASET, here::here("inst", "extdata", "DATASET.csv"))
    - openxlsx::write.xlsx(DATASET, here::here("inst", "extdata", "DATASET.xlsx"))
9. Add dictionary.csv to data-raw/ with columns:
  - directory, file_name, variable_name, variable_type, description
10. Once data reaches tidy state, fill dictionary.csv for each dataset and variable
11. Add, commit and push all changes to GitHub
12. On GitHub, open issue 3 for cross-checking in with data donator for correct understanding of variables in dictionary.csv
13. Initiate documentation folder for writing up metadata and documentation for objects
  - Create new folder R
    - usethis::use_r
14. Write documentation in \R folder using #[[{roxygen}]] comments
  - If dictionary.csv is complete use function generate_roxygen_docs of [[{openwashdata}]] R package to document variables
    - https://github.com/openwashdata/openwashdata/blob/main/R/generate_roxygen_docs.R
  - If data documentation name is the same as package name, then add the following to the *-package.R file
    - ' @aliases testduplicate-package
  - Example
    - https://github.com/Global-Health-Engineering/durbanplasticwaste22/blob/main/R/litterboom_counts.R
    - Object has to be quoted at the end of roxygen comments
  - Resource
    - https://r-pkgs.org/data.html#sec-documenting-data
15. Add an additional package documentation to Package
  - usethis::use_package_doc()
  - Resource
    - https://roxygen2.r-lib.org/articles/rd-other.html#datasets
16. Add, commit and push all changes to GitHub
  - On GitHub, setup issue 4 with details to write up DESCRIPTION file
    - Template
    - List
      - Title
        
        make this title short, not the title of the thesis
      - Description
        
        Brief and to the point describing what's in the data
      - Contributors (name, email, role, ORCID)
        
        Include everyone here
        
        Roles
        
        cre = maintainer
        
        aut = significant contributions
        
        ctb = contributor with smaller contributios
    - Resource
      - https://r-pkgs.org/description.html
17. Add dependencies (not optional if vignettes are used)
  - use_package("dplyr")
  - use_package("ggplot2", "Suggests")
18. Add license
  - usethis::use_cc_by()
19. Complete DESCRIPTION file
  - Add
    - Language: en-GB
20. Add CITATION.cff
  - use the [[{cffr}]] Package once the Description file is complete
  - current #gist to be turned into function for [[{openwashdata}]] R package
    - https://gist.github.com/larnsce/dccdb26762837618c6dda82a5614b584
    - needs doi paramater, which will only be used after first release to Zenodo
21. Use devtools to load, document, check, and install
  - Use keyboard shortcuts
    - devtools::load_all() "Cmd + Shift + L"
    - devtools::document() "Cmd + Shift + D"
    - devtools::check() "Cmd + Shift + E"
    - devtools::install() "Cmd + Shift + B"
22. Create a rmd README for package
  - usethis::use_readme_rmd()
    - Outline template
      - https://github.com/Global-Health-Engineering/durbanplasticwaste22/blob/main/README.Rmd
    - Write [[{openwashdata}]] R function to generate download table from dictionary.csv
      - read_csv("data-raw/dictionary.csv") |>
        
        distinct(file_name) |>
        
        mutate(file_name = str_remove(file_name, ".rda")) |>
        
        rename(dataset = file_name) |>
        
        mutate(
        
        CSV = paste0("[Download CSV](", extdata_path, dataset, ".csv)"),
        
        XLSX = paste0("[Download XLSX](", extdata_path, dataset, ".xlsx)")
        
        ) |>
        
        knitr::kable()
  - devtools::build_readme()
23. Add, commit and push all changes to GitHub
  - On GitHub, open issue 5 to define who writes up which parts of the README
24. Create an examples article for the package
  - usethis::use_article("examples")
  - devtools::build_rmd("vignettes/articles/article.Rmd")
  - Resources
    - https://r-pkgs.org/vignettes.html#sec-vignettes-workflow-writing
  - Prepare one or two data visualisation or table examples
25. Add formal dependencies from Vignette (not necessary for article vignette?)
  - Any package used in a vignette must be a formal dependency, i.e. it must be listed in Imports or Suggests in DESCRIPTION
    - https://r-pkgs.org/vignettes.html#sec-vignettes-eval-option
26. Use devtools to load, document, check, and install
  - Use keyboard shortcuts
    - devtools::load_all() "Cmd + Shift + L"
    - devtools::document() "Cmd + Shift + D"
    - devtools::check() "Cmd + Shift + E"
    - devtools::install() "Cmd + Shift + B"
27. Add automated CMD BUILD check
  - usethis::use_github_action_check_standard()?
    - checks build for Mac, Windows, Linux
28. Create new branch
  - pkgdown
29. Setup pkgdown configuration and github actions
  - usethis::use_pkgdown
  - open _pkgdown.yml
    - add github pages URL
    - add plausible script (plausible for openwashdata to be setup)
      - template:
        
        bootstrap: 5
        
        includes:
        
        in_header: |
30. Build pkgdown website
  - pkgdown::build_site()
31. Add, commit and push all changes to GitHub
32. Edit Home Index
  - use this as template: https://github.com/Global-Health-Engineering/durbanplasticwaste/blob/main/_pkgdown.yml
    - consider writing pkgdown template for this: https://github.com/openwashdata/openwashdata/issues/8
      - https://github.com/tidyverse/tidytemplate
  - https://pkgdown.r-lib.org/articles/pkgdown.html#home-page
  - explore: https://pkgdown.r-lib.org/articles/customise.html
Open Zenodo
- login with GitHub account
- click on dropdown next to email address in top right
  - select GitHub
  - find the repository in the list
  - and flip switch to "ON"
  - click on repo link
- create release v0.0.1 on GitHub
  - initial package releas
- Get the DOI Badge
- Edit the main page and remove text under Additional notes
Open RStudio IDE
- Add DOI badge to repository README.md
- Edit DESCRIPTION
  - edit version number
- Write CITATION.cff and inst/CITATION with DOI entry using #gist or respective [[{openwashdata}]] R Package
  - https://gist.github.com/larnsce/dccdb26762837618c6dda82a5614b584
- pkgdown::build_site()
Open ETH Research Collection (not for openwashdata, but GHE workflow)
- research data -> dataset
- organisational unit
  - tilley
- license
  - creative commons attribution 4.0 International
Common items to be fixed
- R CMD CHECK
  - Notes
    - Non-standard files/directories found at top level:
      - Fix: https://stackoverflow.com/questions/48955103/non-standard-file-directory-found-at-top-level-readme-rmd-persists-even-after

@mbannert: I have updated the workflow to it's current form.

The steps are now a numbered list with roman numerals for the substeps that are TODO boxes to tick in my note-taking tool. I use Roam Research

Here is a screenshot of what that actually looks like:

Screenshot 2023-04-20 at 14 51 21

I realise that there is a huge number of steps in there. I have highlighted the steps that I would like outsource to openwashdata/book#8 with [[{openwashdata}]]. If you need a starting point, then look those tasks.

I've worked to get automated repo generation going and looked into the GitHub. This lead me to the following thoughts:

the current workflow is very monolithic.
Given that you said we do the data work for now we have two audiences
- data submitters
- openwashdata moderators
let's focus on the latter to make their life easier for now
seperate the submission and curation process.

Here's a (less detailed) suggestion: Not exactly sure how to integrate the pkgdown stage, but rather convinced of the rest.

# create data package skeleton (stage 1)
# - basic usethis R package
# - support naming and conventions check
# - create README
# - add license
# - add .cff with author info
# - add basic data processing template file
# - add data_raw folder
# - echo next steps and end first stage, maybe save this status ?
# - add citation Info
# - maybe work with a README template /w placeholder, replaced later throughout
# the process

# support data processing (stage 3)
# - helper functions to pivot data the right way
# - helper functions to extract meta data
# - generate overview of datasets in table, automatically add to README
# - set status publication ready
# statr

# fill pkgdown documentation (stage 4)

# git pushery (stage 4)
# - check status (saved to a .status.RData /.json)
# - check whether package can be built (devtools)
# - move R package to the openwashdata org
# require RSA stuff
# PAT

# how to organize repos and orgs
# openwashdata org contains
# - R package (pinned)
# - github.io page
# - pkgdown docs of the openwashdata package
# possibly have another organization 'openwashdata-submission'
# for incoming data

Thanks, @mbannert. I like the structure along Stages. My feedback:

Stage 1

create data package skeleton (stage 1)

support naming and conventions check

great idea!

add basic data processing template file

yes, that would be very helpful.

echo next steps and end first stage, maybe save this status ?

what does status mean? what can it do?

add .cff with author info

add citation Info

can this be done using a GH action workflow in combination with this gist to keep things updated?

maybe work with a README template /w placeholder, replaced later throughout the process

yes, that would be useful. Two recent examples:

Stage 2

what is Stage 2? The data submission process?

Stage 3

support data processing (stage 3)

helper functions to extract meta data

I would that. Just in the style that you have it for swissdata.

generate overview of datasets in table, automatically add to README

Great! Where reasonable, I would like to see all variables and descriptions listed. If it gets too large (let's say 3 dataframes, each with 10 or more variables, or a dataframe with 100 variables), then this should be moved to a article vignette (usethis::use_article()).

set status publication ready

Is that the same status option you list above?

statr

statR Package? https://github.com/statistikZH/statR

ggplot2 themes and colour palettes along a common openwashdata.org theme (WIP) would be awesome

Stage 4

fill pkgdown documentation (stage 4)

See: openwashdata/openwashdata#8

Stage 5 (?)

git pushery (stage 4)

check status (saved to a .status.RData /.json)

okay, I see now that the status has something to do with building the package?

check whether package can be built (devtools)

great!

move R package to the openwashdata org

that's assuming we have a separate organisation for data submission

how to organize repos and orgs

openwashdata org contains

R package (pinned)

github.io page

pkgdown docs of the openwashdata package

possibly have another organization 'openwashdata-submission for incoming data

I would prefer keeping data submission within the current openwashdata GH org and established issue tracker.

Let's move the development part (the items you list above + book) out of this organisations. Suggested names for new org:

openwashdata-dev
dev-openwashdata

Reasons:

I like the "clean" name for the "community" and hope this GH org will develop into an active space for data submission.
I have already refered to it and linked to it in many places
It feels simpler to move our dev repos rather then moving all the data repos
Data "never" moves and issue tracker can be used centrally in one place to communicate with "data submitters"

I am fine with the org naming argument, but also no need to rush anything now. Let's rather move on for a bit before adding a new org. The main point here was to re-think our mid term focus a bit and split the submission process facilitating functionality a bit from those functions that help us putting up curated packages. In other words, I want to build functions first that help us create data packages, clean up data, add citation and README, pkgdown etc. and push to GitHub -- no matter how the data come to us. Then, in a second step when our new colleagues started working and getting used to publishing packages with our framework, I would address external data owners and their itches when submitting.

@mbannert

An update regarding the cffr package workflow. Haven't tested it yet, but this would reduce the need for:

"Write CITATION.cff and inst/CITATION with DOI entry using #gist or respective [[{openwashdata}]] R Package"

https://github.com/ropensci/cffr/issues/51#issuecomment-1523322283

@mbannert: I have taught a workshop for a reduced version of this workflow yesterday. @mianzg, @sebastian-loos, and two other GHE team members joined. I have written up the workflow as a tutorial for our GHE GitHub pages website. I am planning to do the same for the openwashdata community alongside with the additional functions that we will develop in our openwashdata package.

https://global-health-engineering.github.io/website/2-tutorials/data-package.html

@mbannert @larnsce Maybe you could point where I should put this script. I have started to write an R script to reduce the initialisation workflow of creating R data package. It's currently not in an R function style.

I just run it with command line Rscript init-pkg.R on a terminal

Current functionality:

Interact in console to edit DESCRIPTION title and description. Add CC-BY license by default
Create data-raw directory with a data-processing.R file
Create a README.rmd and remove bad lines (line 41 - end)

Consider to add:

Git initialization
Author information
Citation file

I guess this fits in the openwashdata package for internal use. This should reduce many mouse clicks from our original tutorial.

Fantastic, thank you for the initiative.

Yes, add it to the R folder in openwashdata R package. You can use perplexity.ai to write the documentation for you. Works pretty well and then you can actually build and check the R package.

@mianzg: Let us continue the best workflow for the dictionary.csv here.

One alternative idea I had is to actually use Google Sheets for the dictionary. You could provide a public link to the person that needs to add their variables and actually provide public editing rights. Share that link with them, and then read in the variable_names and description via googlsheets4 package. Once the person is done adding their descriptions or you are done collaborating with them on it, you could change permissions back to editing rights only for us.

People won't edit a CSV. I also find it difficult to edit a plain CSV and open it with MS Excel.

@mianzg: Let us continue the best workflow for the dictionary.csv here.

One alternative idea I had is to actually use Google Sheets for the dictionary. You could provide a public link to the person that needs to add their variables and actually provide public editing rights. Share that link with them, and then read in the variable_names and description via googlsheets4 package. Once the person is done adding their descriptions or you are done collaborating with them on it, you could change permissions back to editing rights only for us.

People won't edit a CSV. I also find it difficult to edit a plain CSV and open it with MS Excel.

I don't think we should provide a public googlesheet link in a github issue. This potentially allows anyone to edit. Then this converts this approach back to email communication again. Otherwise, I think googlesheet editing is very potential to get collaborators work on it.

@mbannert @mianzg @sebastian-loos I had previously used the dataspice package to prepare data publications. They were not published as R data packages, but had a lot of great elements that we could re-use.

I am thinking particularly of the write_spice() function, which writes metadata from a set of CSVs into the JSON-LD:

https://docs.ropensci.org/dataspice/reference/write_spice.html

Package:

https://docs.ropensci.org/dataspice/

We should review this workflow and adapt some of it to our own needs.

Opensci is always a good outlet for packages. I'll check it thanks. If it does not include to many dependencies I am convinced...

@margauxgo @mianzg: I have documented the workflow I have in my notebook as a separate repository:

https://github.com/openwashdata-dev/workflow/blob/main/docs/index.md

openwashdata-dev / book

Write up complete R data package workflow #13