workshop 6: Structuring your projects for current and future you

ngs-docs / 2021-august-remote-computing

Remote computing workshops in August 2021

https://ngs-docs.github.io/2021-august-remote-computing

4 stars 2 forks source link

workshop 6: Structuring your projects for current and future you #6

Open ctb opened 3 years ago

ctb commented 3 years ago

Tuesday August 17 from 9 am - 11:30 PDT

Instructors: Moderator: Saranya Canchi Helpers: Nick

Zoom link:

Description:

In this two hour workshop, we will discuss folder structures for organizing your projects so that you can track inputs, outputs, and processing scripts over time, and keep yourself organized as your projects evolve.

draft lesson: ggg 298 shell, second half https://github.com/ngs-docs/2021-GGG298/tree/latest/Week2-UNIX_for_file_manipulation

owner: ???

ctb commented 3 years ago

Pamela has to leave by 10am.

nick-ulle commented 3 years ago

I can be a helper for this.

ctb commented 3 years ago

assessment questions, draft --

I can structure my projects to account for a typical processing pipeline that takes data from raw data to processed results

I know how to choose where to store small, medium and large data files for both archival and data analysis purposes.

I can retrieve large files directly to remote computers using wget or curl.

jeremywalter commented 3 years ago

Pre: https://forms.gle/xH5M4oX3pbGZVTNfA Post: https://forms.gle/3T9JQ7qwPMPVcPPS8

s-canchi commented 3 years ago

Add quit to exit out of sftp

s-canchi commented 3 years ago

From Pamela:

Getting back to the data sharing and corruption issue, @Nick and I recommend that as a first pass you always run what we call a “data forensics” review of your datasets before you start intensive compute processes. Look at the data (or a subset it’s too big) and run a few simple diagnostic exploratory processes. Is the outcome what you’d expect? For example, with text files corruption issues typically become readily apparent as soon as you start working with them, so checking md5 is less important. For binary files (such as many formats for genetics data) things like md5 become more important. Non-binary quantitative datasets can be corrupted, but it’s unlikely that only the numbers (and not commas, delimiters, etc.) are affected so again the forensics would be immediately telling that you have issues.

s-canchi commented 3 years ago

From @nick-ulle

Useful library to compare tabular outputs: https://github.com/paulfitz/daff
Another useful resource about naming files: https://speakerdeck.com/jennybc/how-to-name-files

PLNReynolds commented 3 years ago

Other DataLab resources that may be useful (both have readers associated, I'll try to get those links added to the event pages soon):