openwashdata-dev / book

0 stars 0 forks source link

Write up complete R data package workflow #13

Closed larnsce closed 3 weeks ago

larnsce commented 1 year ago

Deleted this content because it's outdated. See here for updated worklow: https://github.com/openwashdata/book/issues/13#issuecomment-1516270108

larnsce commented 1 year ago

@mbannert: That's the workflow I am currently working off. This will be continously revised with every iteration of a new R data package. I hope to automate some of these items, at least by preparing issue templates (#14).

larnsce commented 1 year ago
  1. Email (probably)
    • To get started, please
      • get an account on GitHub: https://github.com/
      • open an issue for your data, so that we can communicate on using GitHub: https://github.com/openwashdata/data/issues
      • think about a name for your data package. The name should:
        • have all small letters
        • have no spaces or dashes
        • be a combination of two to three words
        • identify location and/or theme/topic
  2. Open GitHub
    1. If data donator has not opened an issue on openwashdata/data issue tracker, do it yourself
    2. Decide on a name for the repository and corresponding R data package
    3. Create a new repository with the following settings
      • Public
      • Do not add a README
      • Do not add a .gitignore
      • Do not add a LICENSE
    4. Invite the contributor as a collaborator on this repository
    5. Inform contributor that they need to accept the invitation to contribute (probably by Email)
  3. Open RStudio IDE
    1. Check if R Packages [[{devtools}]] and [[{usethis}]] are installed. Install, if they are not.
    2. Create a new project using the R Package [[{devtools}]] template and the same name you used on GitHub
      • If project folder already exists, open it and use
        • usethis::create_package()
      • If project folder does not yet exist, use
        • File -> New Project -> New Directory -> R Package using devtools -> Choose directory name and location of sub-directory
    3. Add git version control to local directory
      • In Console, execute
        • library(usethis)
        • use_git()
          • yes, commit
          • yes, restart
    4. Connect remote repository on GitHub with local repository
      • git remote add origin URL
      • git branch -M main
      • git push -u origin main
    5. Add directory for raw-data to project
      • In Console, execute
        • library(devtools)
        • use_data_raw()
          • This will create a data-raw/ subdirectory.
            • contains a DATASET.R file
              • rename to data_processing.R
    6. Add, commit and push all changes to GitHub
    7. On GitHub, open issue 1 for adding data to data-raw/ folder
    8. Prepare import of data and export of tidy raw data in data_processing.R file
      • At the end of file add export for CSV and XLSX
        • usethis::use_data(DATASET, overwrite = TRUE)
        • fs::dir_create(here::here("inst", "extdata"))
        • write_csv(DATASET, here::here("inst", "extdata", "DATASET.csv"))
        • openxlsx::write.xlsx(DATASET, here::here("inst", "extdata", "DATASET.xlsx"))
    9. Add dictionary.csv to data-raw/ with columns:
      • directory, file_name, variable_name, variable_type, description
    10. Once data reaches tidy state, fill dictionary.csv for each dataset and variable
    11. Add, commit and push all changes to GitHub
    12. On GitHub, open issue 3 for cross-checking in with data donator for correct understanding of variables in dictionary.csv
    13. Initiate documentation folder for writing up metadata and documentation for objects
      • Create new folder R
        • usethis::use_r
    14. Write documentation in \R folder using #[[{roxygen}]] comments
    15. Add an additional package documentation to Package
    16. Add, commit and push all changes to GitHub
      • On GitHub, setup issue 4 with details to write up DESCRIPTION file
        • Template
        • List
          • Title
            • make this title short, not the title of the thesis
          • Description
            • Brief and to the point describing what's in the data
          • Contributors (name, email, role, ORCID)
            • Include everyone here
            • Roles
              • cre = maintainer
              • aut = significant contributions
              • ctb = contributor with smaller contributios
        • Resource
    17. Add dependencies (not optional if vignettes are used)
      • use_package("dplyr")
      • use_package("ggplot2", "Suggests")
    18. Add license
      • usethis::use_cc_by()
    19. Complete DESCRIPTION file
      • Add
        • Language: en-GB
    20. Add CITATION.cff
    21. Use devtools to load, document, check, and install
      • Use keyboard shortcuts
        • devtools::load_all() "Cmd + Shift + L"
        • devtools::document() "Cmd + Shift + D"
        • devtools::check() "Cmd + Shift + E"
        • devtools::install() "Cmd + Shift + B"
    22. Create a rmd README for package
      • usethis::use_readme_rmd()
        • Outline template
        • Write [[{openwashdata}]] R function to generate download table from dictionary.csv
          • read_csv("data-raw/dictionary.csv") |>
            • distinct(file_name) |>
            • mutate(file_name = str_remove(file_name, ".rda")) |>
            • rename(dataset = file_name) |>
            • mutate(
              • CSV = paste0("[Download CSV](", extdata_path, dataset, ".csv)"),
              • XLSX = paste0("[Download XLSX](", extdata_path, dataset, ".xlsx)")
            • ) |>
            • knitr::kable()
      • devtools::build_readme()
    23. Add, commit and push all changes to GitHub
      • On GitHub, open issue 5 to define who writes up which parts of the README
    24. Create an examples article for the package
    25. Add formal dependencies from Vignette (not necessary for article vignette?)
    26. Use devtools to load, document, check, and install
      • Use keyboard shortcuts
        • devtools::load_all() "Cmd + Shift + L"
        • devtools::document() "Cmd + Shift + D"
        • devtools::check() "Cmd + Shift + E"
        • devtools::install() "Cmd + Shift + B"
    27. Add automated CMD BUILD check
      • usethis::use_github_action_check_standard()?
        • checks build for Mac, Windows, Linux
    28. Create new branch
      • pkgdown
    29. Setup pkgdown configuration and github actions
      • usethis::use_pkgdown
      • open _pkgdown.yml
        • add github pages URL
        • add plausible script (plausible for openwashdata to be setup)
          • template:
            • bootstrap: 5
            • includes:
              • in_header: |
    30. Build pkgdown website
      • pkgdown::build_site()
    31. Add, commit and push all changes to GitHub
    32. Edit Home Index
  4. Open Zenodo
    • login with GitHub account
    • click on dropdown next to email address in top right
      • select GitHub
      • find the repository in the list
      • and flip switch to "ON"
      • click on repo link
    • create release v0.0.1 on GitHub
      • initial package releas
    • Get the DOI Badge
    • Edit the main page and remove text under Additional notes
  5. Open RStudio IDE
  6. Open ETH Research Collection (not for openwashdata, but GHE workflow)
    • research data -> dataset
    • organisational unit
      • tilley
    • license
      • creative commons attribution 4.0 International
  7. Common items to be fixed
larnsce commented 1 year ago

@mbannert: I have updated the workflow to it's current form.

The steps are now a numbered list with roman numerals for the substeps that are TODO boxes to tick in my note-taking tool. I use Roam Research

Here is a screenshot of what that actually looks like:

Screenshot 2023-04-20 at 14 51 21

I realise that there is a huge number of steps in there. I have highlighted the steps that I would like outsource to openwashdata/book#8 with [[{openwashdata}]]. If you need a starting point, then look those tasks.

mbannert commented 1 year ago

I've worked to get automated repo generation going and looked into the GitHub. This lead me to the following thoughts:

Here's a (less detailed) suggestion: Not exactly sure how to integrate the pkgdown stage, but rather convinced of the rest.

# create data package skeleton (stage 1)
# - basic usethis R package
# - support naming and conventions check
# - create README
# - add license
# - add .cff with author info
# - add basic data processing template file
# - add data_raw folder
# - echo next steps and end first stage, maybe save this status ?
# - add citation Info
# - maybe work with a README template /w placeholder, replaced later throughout
# the process

# support data processing (stage 3)
# - helper functions to pivot data the right way
# - helper functions to extract meta data
# - generate overview of datasets in table, automatically add to README
# - set status publication ready
# statr

# fill pkgdown documentation (stage 4)

# git pushery (stage 4)
# - check status (saved to a .status.RData /.json)
# - check whether package can be built (devtools)
# - move R package to the openwashdata org
# require RSA stuff
# PAT

# how to organize repos and orgs
# openwashdata org contains
# - R package (pinned)
# - github.io page
# - pkgdown docs of the openwashdata package
# possibly have another organization 'openwashdata-submission'
# for incoming data
larnsce commented 1 year ago

Thanks, @mbannert. I like the structure along Stages. My feedback:

Stage 1

create data package skeleton (stage 1)

  • support naming and conventions check

great idea!

  • add basic data processing template file

yes, that would be very helpful.

  • echo next steps and end first stage, maybe save this status ?

what does status mean? what can it do?

  • add .cff with author info
  • add citation Info

can this be done using a GH action workflow in combination with this gist to keep things updated?

  • maybe work with a README template /w placeholder, replaced later throughout the process

yes, that would be useful. Two recent examples:

Stage 2

what is Stage 2? The data submission process?

Stage 3

support data processing (stage 3)

  • helper functions to extract meta data

I would that. Just in the style that you have it for swissdata.

  • generate overview of datasets in table, automatically add to README

Great! Where reasonable, I would like to see all variables and descriptions listed. If it gets too large (let's say 3 dataframes, each with 10 or more variables, or a dataframe with 100 variables), then this should be moved to a article vignette (usethis::use_article()).

  • set status publication ready

Is that the same status option you list above?

statr

statR Package? https://github.com/statistikZH/statR

Stage 4

fill pkgdown documentation (stage 4)

Stage 5 (?)

git pushery (stage 4)

  • check status (saved to a .status.RData /.json)

okay, I see now that the status has something to do with building the package?

  • check whether package can be built (devtools)

great!

  • move R package to the openwashdata org

that's assuming we have a separate organisation for data submission

how to organize repos and orgs

openwashdata org contains

  • R package (pinned)
  • github.io page
  • pkgdown docs of the openwashdata package

possibly have another organization 'openwashdata-submission for incoming data

I would prefer keeping data submission within the current openwashdata GH org and established issue tracker.

Let's move the development part (the items you list above + book) out of this organisations. Suggested names for new org:

Reasons:

mbannert commented 1 year ago

I am fine with the org naming argument, but also no need to rush anything now. Let's rather move on for a bit before adding a new org. The main point here was to re-think our mid term focus a bit and split the submission process facilitating functionality a bit from those functions that help us putting up curated packages. In other words, I want to build functions first that help us create data packages, clean up data, add citation and README, pkgdown etc. and push to GitHub -- no matter how the data come to us. Then, in a second step when our new colleagues started working and getting used to publishing packages with our framework, I would address external data owners and their itches when submitting.

larnsce commented 1 year ago

@mbannert

An update regarding the cffr package workflow. Haven't tested it yet, but this would reduce the need for:

"Write CITATION.cff and inst/CITATION with DOI entry using #gist or respective [[{openwashdata}]] R Package"

https://github.com/ropensci/cffr/issues/51#issuecomment-1523322283

larnsce commented 1 year ago

@mbannert: I have taught a workshop for a reduced version of this workflow yesterday. @mianzg, @sebastian-loos, and two other GHE team members joined. I have written up the workflow as a tutorial for our GHE GitHub pages website. I am planning to do the same for the openwashdata community alongside with the additional functions that we will develop in our openwashdata package.

https://global-health-engineering.github.io/website/2-tutorials/data-package.html

mianzg commented 1 year ago

@mbannert @larnsce Maybe you could point where I should put this script. I have started to write an R script to reduce the initialisation workflow of creating R data package. It's currently not in an R function style.

I just run it with command line Rscript init-pkg.R on a terminal

Current functionality:

Consider to add:

I guess this fits in the openwashdata package for internal use. This should reduce many mouse clicks from our original tutorial.

larnsce commented 1 year ago

Fantastic, thank you for the initiative.

Yes, add it to the R folder in openwashdata R package. You can use perplexity.ai to write the documentation for you. Works pretty well and then you can actually build and check the R package.

larnsce commented 1 year ago

@mianzg: Let us continue the best workflow for the dictionary.csv here.

One alternative idea I had is to actually use Google Sheets for the dictionary. You could provide a public link to the person that needs to add their variables and actually provide public editing rights. Share that link with them, and then read in the variable_names and description via googlsheets4 package. Once the person is done adding their descriptions or you are done collaborating with them on it, you could change permissions back to editing rights only for us.

People won't edit a CSV. I also find it difficult to edit a plain CSV and open it with MS Excel.

mianzg commented 1 year ago

@mianzg: Let us continue the best workflow for the dictionary.csv here.

One alternative idea I had is to actually use Google Sheets for the dictionary. You could provide a public link to the person that needs to add their variables and actually provide public editing rights. Share that link with them, and then read in the variable_names and description via googlsheets4 package. Once the person is done adding their descriptions or you are done collaborating with them on it, you could change permissions back to editing rights only for us.

People won't edit a CSV. I also find it difficult to edit a plain CSV and open it with MS Excel.

I don't think we should provide a public googlesheet link in a github issue. This potentially allows anyone to edit. Then this converts this approach back to email communication again. Otherwise, I think googlesheet editing is very potential to get collaborators work on it.

larnsce commented 1 year ago

@mbannert @mianzg @sebastian-loos I had previously used the dataspice package to prepare data publications. They were not published as R data packages, but had a lot of great elements that we could re-use.

I am thinking particularly of the write_spice() function, which writes metadata from a set of CSVs into the JSON-LD:

https://docs.ropensci.org/dataspice/reference/write_spice.html

Package:

https://docs.ropensci.org/dataspice/

We should review this workflow and adapt some of it to our own needs.

mbannert commented 1 year ago

Opensci is always a good outlet for packages. I'll check it thanks. If it does not include to many dependencies I am convinced...

larnsce commented 7 months ago

@margauxgo @mianzg: I have documented the workflow I have in my notebook as a separate repository:

https://github.com/openwashdata-dev/workflow/blob/main/docs/index.md