Data Storage Practices - Githubissues

gulfofmaine / Tidal_Exchanges

Repository for discussions around research team best practices following the Openscapes Champions Program

4 stars 4 forks source link

Data Storage Practices #3

Open adamkemberling opened 3 years ago

adamkemberling commented 3 years ago

It was noted during our discussions on 10/8/2021 that we (GMRI) have different data and project organization approaches tailored to different project types. Part of the "homework" that we discussed was to describe/draw and share the types of approaches that different team members are using, and to think critically about their strengths/weaknesses. Ultimately we'd like to settle on some best practices for different project types.

an example workflow strategy was shared using the tryeraser app

Feel free to use that workspace for sketching and commenting.

Jamie-Behan commented 3 years ago

I don’t think I’ve been in the Kerr lab long enough to fully grasp all the ins-and-outs of our lab’s organization approaches, so I will let another lab member talk about that. But I can shed a little insight on the “practices” from the Chen lab, if that’s helpful.

Typical workflow of the Chen lab: • Download any pre-existing code/data that might be available from another source (ask around who to ask/where to find it) • Create specific folder on personal computer (not shared) and place all related materials into folder. • Edit the code & process data to fit your needs. • Once your project is done or code is at a sharable state, upload code to lab GitHub and/or personal GitHub (if it hasn’t been already). • Make sure shared code is annotated well for future users to understand/navigate. • Collaborative writing was done using Google docs Pros of this method: • Since everything is on personal computer (or private GitHub) until finalized versions, you don’t have to worry about messing up/overwriting original code or raw data since you are not working in same shared drive where all materials from multiple users exist. • You can set up and organize folders exactly how you want • Simple by not having multiple different sharing/cloud platforms Cons of this Method: • Not well suited for collaboration • Cumbersome to “ask around” for data/code if it wasn’t yet published to lab GitHub. • No formal layout method was established and therefore led to inconsistencies between files and inefficient sharing.

We didn't really have any different management practices for difference project types, so I don't think any of these "techniques" should be used for our purposes, but it may be helpful as an example of what NOT to do!

adamkemberling commented 3 years ago

Hey @Jamie-Behan, thanks for taking the time to share that. I am curious what kind of follow through you had with people taking the time to get all the code cleaned up and up on github once the project was done? There is an opportunity in this strategy to not bring all the clutter of different scripts used when "figuring it out", but I can also see the extra step of making code available becoming an afterthought when it already works fine in its current form. Thanks again for sharing!

Jamie-Behan commented 3 years ago

@adamkemberling In my experience, some people were better about cleaning code up than others. But from my perspective, the incentive to publish clean code was that once shared with people, if it was well annotated and easy to follow, users would be able to utilize it right away and were less likely to ask follow up questions, saving me (and the future users) time in the long run I think. Obviously there are other factors to consider too, like the complexity of the project/code, but in general I think well annotated code helps everyone in the long run, especially when looking back at old code that you wrote yourself but may not remember all of the details as well.

ahart1 commented 3 years ago

I am in the same boat as @Jamie-Behan so I will talk about the Fay Lab's workflow and my own personal take on this workflow.

Workflows within the Fay lab are pretty variable because we rarely work on projects that share code or data so it is up to individuals to structure projects. I think most people have a private GitHub repository for projects that we can share with collaborators and we typically maintain data on personal computers. We all handle manuscripts differently based on the text editor we prefer, but we have moved to Google Docs for materials developed by multiple lab members because it removes the need to save multiple versions to local copies and enables fast collaboration. The lab has a Google Drive to store these collaborative documents, shared meeting notes, and other shared resources. The lab has a GitHub organization that hosts repositories for projects involving multiple lab members and workshop materials with multiple contributors. The organization also owns the lab manual, code of conduct, and meeting schedule so there is a common place to look for information. Finally, we have access to a lab storage drive that serves as a long-term archive and is probably underutilized.

FayLabWorkflow

For my own projects I use a mix of GitHub and Google Drive to organize my workflow. I typically have a private repository on GitHub with collaborators granted access. I maintain a folder with the same name and file structure on Google Drive to back up data and other files for the project (on my personal computer these materials live in the same folder as the code). I have been using Overleaf to write manuscripts because it interfaces with GitHub easily and has nice reference management (cons: requires a subscription and uses Latex to write/structure manuscripts so there can be a learning curve). Data management is probably the weakest link in my workflow since my data often needs to be stored separately from my code

MyWorkflow .

mdmazur commented 3 years ago

I don't know the details of all workflows in the Kerr lab, but I can describe the workflow for the MSE work.

This workflow involves the high performance computing cluster (HPCC) at UMass Dartmouth. Other workflows that I use that don't use the HPCC usually just have connections between Github and a personal computer. If we are collaborating on a manuscript/report, we do that over Google docs but then once the manuscript or report is almost finalized, it's moved to Box. It can be confusing as to what version is the current version if there are versions on both Box and Google drive.

The HPCC allows thousands of iterations to be run much faster than on a personal computer. The HPCC uses code from a specific branch on Github, so work on my personal computer can be different from that on the HPCC. Downloading data from the HPCC to Box takes a long time though (sometimes longer than the simulations themselves), which is a downside of this approach. The Box token also needs to be updated every so often.

One thing to note is that Github does not handle .exe files, and we use an .exe file in the MSE work. So that must be downloaded separately and does not work on Macs.

MSE_Workflow (1)

jerellejesse commented 3 years ago

I think my workflow is pretty similar to others. When I use the HPCC I follow the same workflow as Mackenzie. Although, I do feel like we had some problems here with Box not being the best for working collaboratively. For example, it didn't always work well if we had a spreadsheet to assign task and we have moved more to github projects for this. Otherwise, I use github for code/data and Box for any reports, manuscripts, etc. I usually keep a copy of raw data and final manuscripts/reports on an external hard drive too just to back things up and because of switching jobs with different cloud services/computers frequently in recent years.

When I started I received some pretty unorganized box folders with data, code, literature, reports, etc. and no read me files. Or maybe they were well organized in a way that didn't make immediate sense to me! I think we made a lot of progress when Sarah left using R projects, the here function, and github. I could run her code in minutes on my computer with guidance from her read me files, but I think Box gave Lisa problems passing off Sarah's reports and such.

Also, maybe we can come back to this later since it isn't directly workflow, but I was wondering if anyone has ideas about sharing literature. I usually use Mendeley and make groups and add people, but was curious if anyone had a better idea or thoughts about best practices.

aallyn commented 3 years ago

Nothing ground breaking from me different from what others do and I largely follow Adam's workflow -- code on GitHub, then data/results/reports on Box (either RES_Data or in a project file in Mills Lab). It definitely isn't the best solution and looking forward, as Mackenzie pointed out, having data on Box (or writing results to Box) from any cloud computing source can be incredibly slow. I've done a bit with Docker/Docker Images/Digital Ocean, and it has been relatively painless except for the data/results transfer. For me, I think I'd lean towards prioritizing best practices for a cloud computing style workflow as the goal. I've only got limited experience, though it seems that if you work towards that goal, you can still easily run things locally without leveraging the cloud computing power. On the flip side, if we focus on "local" computing workflows, there's no guarantee it is going to work if you end up needing to use cloud computing services.

LGCarlson commented 3 years ago

I think my workflow has some additional complexity due to working with confidential data maybe more frequently than others and across a large number of projects, but here's my current data storage strategy.... I do not recommend said strategy.

lkerr commented 3 years ago

GitHub

Personal account for private repositories for ongoing projects.
GMRI GitHub repositories for completed projects.

Box (formerly OneDrive, formerly ShareDrive)

Personal Box folder: all my files, and R code for work I am doing individually.
Kerr Lab Box folder: Shared lab resources (lab and field data, downloaded data from other sources, photos, shapefiles, etc)
Shared Box folders from research staff: Shared R code or word docs –have done this when working with one other person internal to GMRI, but now more typically this is done through Github there is a lot of duplication there across individual’s folders for a particular project. I would like to move to share project folders. naming conventions are needed

Google Drive (in the past Drop box)

Shared folders for larger projects with external collaborators
Collaborative writing, manuscripts

*I don’t save anything significant on my personal computer

kemills commented 3 years ago

I have most of my files on my personal computer (although when I get my new one, I will start routinely working from Box). We use the server for confidential data, RES_Data in Box for shared non-confidential data, and project folders in the MillsLab folder for data sets that one person is primarily using.

I use the Mills Lab folder on Box to see what some people are working on and track progress/provide input to their work. Usually this involves just 1:1 collaboration. Other lab members update me through RMarkdown files or other summaries (PPTs, emails) that may or may not be stored in Box--often, I just don't know where they live. Our shared workflows are managed through GitHub, and I need to become a better GitHub user to stay up to date on things that are housed there.

For working on large collaborations and collaborative writing (internal and external), I mostly rely on Google Drive. I still have a few projects in Dropbox and have to use Teams for one large collaborative effort.

mglbrjs commented 3 years ago

Presentation1

I have all my files on Box in the Mills Lab folder. For any project I work on, I typically have a file structure similar to the above screenshot for the Pew simulation project. I have Code, Data, Temp Results, Draft Docs, and a Papers folder. I usually make Rmarkdown html files for each script and place those and any outputs in the Temp Results folder. In the Papers folder, most of the time I try to name the files in a similar way and have a spreadsheet in the folder with details on each paper (title, journal, topic, etc). Draft Docs holds any presentations and manuscript drafts.