Open ctb opened 3 years ago
Pamela has to leave by 10am.
I can be a helper for this.
assessment questions, draft --
I can structure my projects to account for a typical processing pipeline that takes data from raw data to processed results
I know how to choose where to store small, medium and large data files for both archival and data analysis purposes.
I can retrieve large files directly to remote computers using wget or curl.
From Pamela:
Getting back to the data sharing and corruption issue, @Nick and I recommend that as a first pass you always run what we call a “data forensics” review of your datasets before you start intensive compute processes. Look at the data (or a subset it’s too big) and run a few simple diagnostic exploratory processes. Is the outcome what you’d expect? For example, with text files corruption issues typically become readily apparent as soon as you start working with them, so checking md5 is less important. For binary files (such as many formats for genetics data) things like md5 become more important. Non-binary quantitative datasets can be corrupted, but it’s unlikely that only the numbers (and not commas, delimiters, etc.) are affected so again the forensics would be immediately telling that you have issues.
From @nick-ulle
Other DataLab resources that may be useful (both have readers associated, I'll try to get those links added to the event pages soon):
Tuesday August 17 from 9 am - 11:30 PDT
Instructors: Moderator: Saranya Canchi Helpers: Nick
Zoom link:
Description:
draft lesson: ggg 298 shell, second half https://github.com/ngs-docs/2021-GGG298/tree/latest/Week2-UNIX_for_file_manipulation
owner: ???