ttimbers / jsm2023-teaching-reproducibility-and-responsible-workflow

19 stars 1 forks source link

Teaching reproducibility and responsible workflow

JSM 2023 Journal of Statistics and Data Science Education Invited Paper Session

Modern statistics and data science utilizes an iterative data analysis process to solve problems and extract meaning from data in a reproducible manner. The importance of the data analysis cycle has also been described in many places, including the ASA's guidelines for statistics majors and the Park City report.

The National Academies of Science, Engineering, and Medicine's (NASEM) 2018 "Data Science for Undergraduates" consensus study identified the importance of workflow and reproducibility as a component of data acumen needed in our graduates. The NASEM report stated that "documenting, incrementally improving, sharing, and generalizing such workflows are an important part of data science practice owing to the team nature of data science and broader significance of scientific reproducibility and replicability."

In parallel, the NASEM 2019 report on "Reproducibility and replicability in science" provided guidance about how to foster transparency and rigor in research. An issue of the Journal of Statistics and Data Science Education, published in November 2022, featured 11 papers plus an editorial on approaches to motivate and teach reproducibility and responsible workflow.

In this session, authors of papers in the issue and other experts on open and reproducible science will discuss some of the challenges and opportunities in helping students develop these important skills.

Session chair: Monica Alexander (University of Toronto)

Collaborative writing workflows: building blocks towards reproducibility

Sara Stoudt (Bucknell University)

Working with data necessitates collaboration. Although students often learn technical workflows to wrangle and analyze data, these workflows may break down or require adjustment to accommodate the different stages of the writing process when it is time to face the communication phase of the project. In this talk, I describe two writing workflows for use by students in a final-project setting. One workflow involves version control and aims to minimize the chance of a merge conflict throughout the writing process, and the other aims to add some level of reproducibility to a Google-Doc-heavy writing workflow (i.e., avoid manual copying and pasting). Both rely on a division of the labor, require a plan (and structure) to be created and followed by members of a team, and involve communication outside of the final report document itself.

Opinionated practices for teaching reproducibility: motivation, guided instruction, and practice

Tiffany Timbers (University of British Columbia)

Reproducibility is a critical component to creating trustworthy data analyses, however most students enter the field of data science with other topics in mind, such as the current hot topic of machine learning. This, along with the highly technical nature of current reproducibility tools, present out-of-the gate challenges in teaching reproducibility. What can a data science educator do? Over several iterations of teaching courses focused on reproducible data science tools and workflows at the University of British Columbia, we have found that providing extra motivation, guided instruction and lots of practice are key to effectively teaching this challenging, yet important subject. In this talk we present examples of how we motivate, guide, and provide ample practice opportunities to students to effectively engage them in learning how to perform reproducible data analyses.

From teaching to practice: Insights from the Toronto Reproducibility Conferences

Rohan Alexander (University of Toronto)

Theoretical statistics has well-established norms that govern what is required for a claim to be deemed credible - a proof that has been verified by others. In contrast, applied statistics and data science relies heavily on computation rather than formal proofs. In recent years, however, there has been considerable innovation in bringing a level of rigour to claims based on code and data, which is comparable to that of claims made in statistical theory. One particular challenge is in teaching these innovations, both in terms of content and pedagogical methods. The Toronto Reproducibility Conference is a multi-day, hybrid conference hosted by CANSSI Ontario and the University of Toronto's Data Sciences Institute that has been held in 2021, 2022, and 2023. This talk will summarize learnings from the "Teaching reproducibility" track, and in doing so, discuss the emerging consensus around teaching reproducible applied statistics and data science, and what the future may look like.

Teaching reproducibility and responsible workflow: an editor's perspective

Nicholas Horton (Journal of Statistics and Data Science Education and Amherst College)

In 2021, Project TIER and the Sheffield Methods Institute organized a ten-week long symposium with a focus on computational reproducibility. This gathering was the impetus for a special issue in the Journal of Statistics and Data Science Education (JSDSE) on teaching reproducibility and responsible workflows. In this talk, I summarize key lessons from the symposium, the benefits that computational skills provide students for future education and employment, how this work fosters their intellectual development more broadly, and through my perspective as the editor of JSDSE, how this topic relates to scholarship in statistics and data science education. I close with comments on the new JSDSE requirements that code and data supporting articles in the journal be shared publicly.

Discussant: Chris Paciorek (University of California, Berkeley)

November 2022 issue of the Journal of Statistics and Data Science Education