minisciencegirl / studyGroup

http://minisciencegirl.github.io/studyGroup/
Other
43 stars 20 forks source link

File Organization and Naming #20

Closed jhchiang closed 7 years ago

jhchiang commented 9 years ago

Hello all,

This is a request in the realm of the super basic - could we have a session on strategies for organization of files?

For example, I'm dipping my toes into using Cluster3.0 and Java Treeview. I try a bunch of different types of clustering with different filters set in Cluster, that generates a slightly different file than previous tries, from the same dataset. I want to include details of what the settings were or what dataset I started out with. Rather than saving a file name a mile long "FlightHomLog2_av1_Euc_ArrayEucl_AvgLink", there must be many different better ways to organizing my file space.

Thanks!

Jen

ahippman commented 9 years ago

GREAT IDEA!!!! This is something I'm struggeling with as well....

Anna Hippmann, PhD Student

Department of Earth, Ocean and Atmospheric Sciences The University of British Columbia Room 2041, Earth Sciences Building 2207 Main Mall Vancouver, British Columbia Canada, V6T 1Z4

ahippman@eos.ubc.ca office +1-604-827-5459 cell +1-604-771-8346

On May 14, 2015, at 10:24 AM, jhchiang notifications@github.com wrote:

Hello all,

This is a request in the realm of the super basic - could we have a session on strategies for organization of files?

For example, I'm dipping my toes into using Cluster3.0 and Java Treeview. I try a bunch of different types of clustering with different filters set in Cluster, that generates a slightly different file than previous tries, from the same dataset. I want to include details of what the settings were or what dataset I started out with. Rather than saving a file name a mile long "FlightHomLog2_av1_Euc_ArrayEucl_AvgLink", there must be many different better ways to organizing my file space.

Thanks!

Jen

— Reply to this email directly or view it on GitHub.

jennybc commented 9 years ago

Ha! I just spent a late night / early morning contributing some stuff to the Reproducible Science Workshop that's going on right now at Duke! I contributed slides on file naming and organization, among other things:

Maybe we could convene a session some time soon based on this curriculum and whatever improvement come out of this inaugural workshop?

@jhchiang To answer your question, I think the filenames you are considering may very well be EXACTLY the right idea. Cumbersome, perhaps, but this is actually a good strategy in many settings.

minisciencegirl commented 9 years ago

That would be fantastic @jennybc! Please open an issue for a time/ date when you get a chance. The Study Group has been a great platform for testing out teaching topics/ practice talks.

bkatiemills commented 9 years ago

I heard great things about the inaugural run of this workshop at Duke; shall we cherry-pick from there to answer @jhchiang's original question? Lots of room in July still!

minisciencegirl commented 8 years ago

@jhchiang: Saw these two papers that may help with file organization as well as reproducible research. http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424 http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285 @BillMills @jhchiang and @ahippman: are you guys interested in a mini-journal club in September? How about 9/16, 3 pm?

bkatiemills commented 8 years ago

I'm in! A number of groups are trying out journal-club-like activities, but no one has tried an actual journal club - let's do it!

rozdakin commented 8 years ago

I would be up for this!

And not just a way to procrastinate getting organized. We don't have to do it until after the journal club, right?

minisciencegirl commented 8 years ago

Awesome! The event is scheduled.

On Aug 25, 2015, at 2:53 PM, Roz Dakin notifications@github.com wrote:

I would be up for this!

And not just a way to procrastinate getting organized s. We don't have to do it until after the journal club, right?

— Reply to this email directly or view it on GitHub.

minisciencegirl commented 8 years ago

Hi guys, would it be possible to move this to 4pm, on the same day? I can book a room for us in the Pharm Sci building, Room 3340.

rozdakin commented 8 years ago

Works for me

dpshelio commented 8 years ago

Thanks you all for this material. This is something I've been fighting for a while but I have not found a perfect solution. My method looks quite similar to the slides @jennybc showed... but because my raw data may be used in multiple projects I'm getting now to use a database where I can tag the files and avoiding having the same file copied multiple times (besides the database can contain additional metadata information that's not possible on the filename). I do solar physics and the "raw" data step may be different than yours... but the rest is the probably equally applicable.

One other thing, have you seen open science framework? it may be useful for you... mostly if you are working with more than one person.

Thanks again!! :)

minisciencegirl commented 8 years ago

Hi @dpshelio: Thanks for your suggestion! Hopefully we will see you at today's journal club. The Open Science Framework sounds great - have you used it in your work? Care to give a demo? Cc: @BillMills

dpshelio commented 8 years ago

@minisciencegirl I'm based in UK... I think I won't be able to get there in time ;) I've not used OSF enough to know it very well.. - I've just tried with one project and I don't really understand some of the terms they use and how to apply them to my workflow, eg. registrations or collections, maybe it's some common terms used in other disciplines. They have a few videos showing how to get started

bkatiemills commented 8 years ago

Thanks, @dpshelio! I worked kind of tangentially with / around some of the OSF folks while I was at Mozilla - they're a great team, but I've never had the chance to dig into their actual product before. Would be interesting to dig in and be a bit more fluent there.

ivanhanigan commented 8 years ago

@minisciencegirl thanks for pointing me here!

There is a bunch of good advice already here, and I recommend the slides https://github.com/Reproducible-Science-Curriculum/rr-organization1/tree/master/slides/naming-slides and the two PLOS articles, but I wanted to pull out the two things I think it is important to get right:

I think that the substring chunks are explained well in the slides link above (summary, use '_' or '-' to split the string), but I think that the ordering problem can be discussed more.

Ordering things into clusters based on how variable they are across a project

I've been thinking about Hadley Wickham's points about tidy data, and the order that columns should be arranged in tabular data. The principles are similar I think.

A good ordering makes it easier to scan the raw values. One way of
organizing variables is by their role in the analysis: are values
fixed by the design of the data collection, or are they measured
during the course of the experiment? Fixed variables describe the
experimental design and are known in advance. Computer scientists
often call fixed variables dimensions, and statisticians usually
denote them with subscripts on random variables. Measured variables
are what we actually measure in the study. Fixed variables should come
first, followed by measured variables, each ordered so that related
variables are contiguous. Rows can then be ordered by the first
variable, breaking ties with the second and subsequent (fixed)
variables. 

Wickham, H. (2014). Tidy Data. JSS Journal of Statistical Software, 59(10). Retrieved from http://www.jstatsoft.org/

One way that we did this:

Colleagues and I came up with the following protocol for an ecology and biodiversity database

  1. Project name (optional sub-project name)
  2. Data type (such as experimental unit, observational unit, and/or measurement methods)
  3. Geographic location (State, Country)
  4. Temporal frequency and coverge Annual or seasonal tranches

Tidy data generalisable concepts are dimensions and variables

The concept of dimensions and variables can be useful here, and especially for deciding on filenames. Dimensions are fixed or change slowly while variables change more quickly . For example the project name is 'fixed', that is it does not change across the files, but the sub-project name does change, just more slowly (say there may be 2-3 different sub-projects within a project). Then there may be a set of data types, and these 'change' more quickly than the sub-project name (by change I mean, there are more of them). Then the geographic and temporal variables might change quickest of all.

So a general rule for the order of things can be stated: The more fixed variables should come first (those things that don't change, or don't change much), followed by the more fluid variables (or things that change more across the project). List elements can then be ordered so that the groups of things that are similar will always be contiguous, and vary sequentially within clusters.

Perhaps an example would be easier to understand. Here is a set of file names that we constructed for one of our ecological field sites (project) and plots (sub-project or measurement location):

Notice we also had a controlled vocabulary of data types and their acronyms before starting this

| Filename                                                            | Title                                                                                                                                 |
|---------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------|
| asn_fnqr_soil_charact_robson_2011.csv                               | Soil Data, Far North Queensland Rainforest SuperSite, Robson Creek, 2011                                                              |
| asn_fnqr_soil_pit_robson_2012.csv                                   | Soil Pit Data, Water Content and Temperature, Far North Queensland Rainforest SuperSite, Robson Creek, 2012                           |
| asn_fnqr_veg_seedling_robson_2010-2012.csv                          | Seedling Survey,  Far North Queensland Rainforest SuperSite, Robson Creek, 2010-2012                                                  |
| asn_fnqr_veg_seedling_transect_coord_robson_2010-2012.csv           | Seedling Survey,  Far North Queensland Rainforest SuperSite, Robson Creek, 2010-2012                                                  |
| asn_fnqr_core_1ha_robson_2014.csv                                   | Soil Pit Data, Soil Characterisation, Far North Queensland Rainforest SuperSite, Robson Creek, Core 1 ha plot, 2014                   |
| asn_fnqr_fauna_biodiversity_ctbcc_2012.csv                          | Vertebrate Fauna Biodiversity Monitoring, Far North Queensland Rainforest SuperSite, CTBCC, 2012                                      |
| asn_fnqr_fauna_biodiversity_ctbcc_2013.csv                          | Vertebrate Fauna Biodiversity Monitoring, Far North Queensland Rainforest SuperSite, CTBCC, 2013                                      |
| asn_fnqr_fauna_biodiversity_ctbcc_capetrib_2014.csv                 | Avifauna Monitoring, Far North Queensland Rainforest SuperSite, Cape Tribulation, 2014                                                |
| asn_fnqr_fauna_biodiversity_ctbcc-lu11a_2014.csv                    | Vertebrate Fauna Biodiversity Monitoring, Far North Queensland Rainforest SuperSite, CTBCC, LU11A, 2014                               |
| asn_fnqr_fauna_biodiversity_ctbcc-lu7a_2014.csv                     | Vertebrate Fauna Biodiversity Monitoring, Far North Queensland Rainforest SuperSite, CTBCC, LU7A, 2014                                |
| asn_fnqr_fauna_biodiversity_ctbcc-lu7b_2014.csv                     | Vertebrate Fauna Biodiversity Monitoring, Far North Queensland Rainforest SuperSite, CTBCC, LU7B, 2014                                |
| asn_fnqr_fauna_biodiversity_ctbcc-lu9a_2014.csv                     | Vertebrate Fauna Biodiversity Monitoring, Far North Queensland Rainforest SuperSite, CTBCC, LU9A, 2014                                |
| asn_fnqr_fauna_biodiversity_ctbcc-lu11a_2009-2011.csv               | Vertebrate Fauna Biodiversity Monitoring, Far North Queensland Rainforest SuperSite, CTBCC, LU11A, 2009-2011                          |
| asn_fnqr_fauna_biodiversity_ctbcc-lu7a_2009-2011.csv                | Vertebrate Fauna Biodiversity Monitoring, Far North Queensland Rainforest SuperSite, CTBCC, LU7A, 2009-2011                           |
| asn_fnqr_fauna_biodiversity_ctbcc-lu9a_2009-2011.csv                | Vertebrate Fauna Biodiversity Monitoring, Far North Queensland Rainforest SuperSite, CTBCC, LU9A, 2009-2011                           |
| asn_fnqr_fauna_biodiversity_habitat_codes_ctbcc-lu11a_2009-2011.pdf | Vertebrate Fauna Biodiversity Monitoring, Far North Queensland Rainforest SuperSite, CTBCC, LU11A, 2009-2011                          |
| asn_fnqr_fauna_biodiversity_habitat_codes_ctbcc-lu9a_2009-2011.pdf  | Vertebrate Fauna Biodiversity Monitoring, Far North Queensland Rainforest SuperSite, CTBCC, LU9A, 2009-2011                           |
| asn_fnqr_fauna_biodiversity_habitat_codes_ctbcc-lu7a_2009-2011.pdf  | Vertebrate Fauna Biodiversity Monitoring, Far North Queensland Rainforest SuperSite, CTBCC, LU7A, 2009-2011                           |
| asn_fnqr_fauna_birds_capture_robson_2011-2014.csv                   | Bird Capture Data, Far North Queensland Rainforest SuperSite, Robson Creek, 2011-2014                                                 |
| asn_fnqr_fauna_birds_robson_2010-2014.csv                           | Bird Survey Data, Far North Queensland Rainforest SuperSite, Robson Creek, 2010-2014                                                  |
| asn_fnqr_fauna_invert_moth_robson_2009.csv                          | Moth Inventory at Canopy and Ground Level, Far North Queensland Rainforest SuperSite, Robson Creek, 2009                              |
| asn_fnqr_fauna_invert_moth_robson_2010.csv                          | Moth Inventory at Canopy and Ground Level, Far North Queensland Rainforest SuperSite, Robson Creek, 2010                              |
| asn_fnqr_fauna_invert_moth_robson_2011.csv                          | Moth Inventory at Canopy and Ground Level, Far North Queensland Rainforest SuperSite, Robson Creek, 2011                              |
| asn_fnqr_fauna_invert_robson_25ha_2013                              | Invertebrate Fauna Survey, Far North Queensland Rainforest SuperSite, Robson Creek, 25 Ha Plot, 2013                                  |
| asn_fnqr_geo_tracks_100m_grid_robson_2010-2013.kml                  | Base Geographical Data,  Far North Queensland Rainforest SuperSite, Robson Creek, 2010-2013                                           |
| asn_fnqr_geo_tracks_robson_2010-2013.kml                            | Base Geographical Data,  Far North Queensland Rainforest SuperSite, Robson Creek, 2010-2013                                           |
| asn_fnqr_geo_tracks_robson_2010-2013.mdb                            | Base Geographical Data,  Far North Queensland Rainforest SuperSite, Robson Creek, 2010-2013                                           |
| asn_fnqr_geo_tracks_trees_robson_2010-2013.kml                      | Base Geographical Data,  Far North Queensland Rainforest SuperSite, Robson Creek, 2010-2013                                           |
| asn_fnqr_soil_properties_ddc_2013.csv                               | Soil Data, Far North Queensland Rainforest SuperSite, Daintree Discovery Centre, 2013                                                 |
| asn_fnqr_soil_properties_ddc_2014.csv                               | Soil Data, Far North Queensland Rainforest SuperSite, Daintree Discovery Centre, 2014                                                 |
| asn_fnqr_soil_properties_robson_2014.csv                            | Soil Data, Far North Queensland Rainforest SuperSite, Robson Creek, 2014                                                              |
| asn_fnqr_stream_chem_robson_201310.csv                              | Water Chemistry Data, Far North Queensland Rainforest SuperSite, Robson Creek, 201310-201311                                          |
| asn_fnqr_stream_chem_robson_201310-201405.csv                       | Water Chemistry Data, Far North Queensland Rainforest SuperSite, Robson Creek, 201310-201405                                          |
| asn_fnqr_stream_chem_robson_201311.csv                              | Water Chemistry Data, Far North Queensland Rainforest SuperSite, Robson Creek, 201310-201311                                          |
| asn_fnqr_stream_chem_std_methods_robson_2013.pdf                    | Water Chemistry Data, Far North Queensland Rainforest SuperSite, Robson Creek, 201310-201311                                          |
| asn_fnqr_stream_phys-chem_diagram_robson_2013.pdf                   | Stream Physico-Chemical Data,  Far North Queensland Rainforest SuperSite, Robson Creek, 201304-201305                                 |
| asn_fnqr_stream-phys-chem_robson_201304-201305.csv                  | Stream Physico-Chemical Data,  Far North Queensland Rainforest SuperSite, Robson Creek, 201304-201305                                 |
| asn_fnqr_veg_cwd_robson_core_1ha_2012.csv                           | Vascular Plant Data, Far North Queensland Rainforest SuperSite, Robson Creek, Core 1 ha, 2012                                         |
| asn_fnqr_veg_dbh-h_capetrib_crane_plot_2001.csv                     | Vascular Plant Data, Far North Queensland Rainforest SuperSite, Cape Tribulation, 1 ha Crane Plot, 2001                               |
| asn_fnqr_veg_dbh-h_capetrib_crane_plot_2005.csv                     | Vascular Plant Data, Far North Queensland Rainforest SuperSite, Cape Tribulation, 1 ha Crane Plot, 2005                               |
| asn_fnqr_veg_dbh-h_capetrib_crane_plot_2010.csv                     | Vascular Plant Data, Far North Queensland Rainforest SuperSite, Cape Tribulation, 1 ha Crane Plot, 2010                               |
bkatiemills commented 8 years ago

Great high-level strategy + examples - these are the sorts of considerations that really serve a discussion on data standardization well. The more logically ordered / less arbitrary a data standard is, the more people will think it makes enough sense to adopt.

One thing I'm curious about: how do you deal with subsetting this data? As in, if you want all the 2013 data - what's the plan? ls *_2013* I guess. Actually, that's sort of slick - at Journal Club when we read the papers linked above, we were chewing on the idea of organization through folders, which becomes annoying if you want to collect all the files for a given year if you've put them in folders by site (or vice versa, or whatever). If you're comfortable having a giant directory of a zillion data files named this way, some basic shell usage can provide more flexible subsetting. Interesting idea!

ivanhanigan commented 8 years ago

@BillMills We avoided having zillions of files in a folder, but used folders too. The main repo we used is a data portal called 'Metacat', an open source portal designed to host all kinds of ecological data https://knb.ecoinformatics.org/#tools. This indexes files and the user can browse or search a catalogue.

I think the consideration about people being uncomfortable with a zillion files in a folder is important (plus issues of disk I/O when backing up or copying such a beast). We were safeguarding against users downloading a stack of files, and then not knowing where they had come from... a potential issue I think you'll agree? This way it seems not to hard to figure out what these are regardless where they end up on the downstream users computer.

In my own work, I do like the ability to use substrings and regular expressions on lists of zillions of files, but I also keep these in sensible folders. For eg:

setwd("~/data/AWAP_GRIDS/")
filelist <- dir(recursive=T, pattern="totals_2013")
filelist
      [,1]                              
 [1,] "data/totals_2013010120130131.tif"
 [2,] "data/totals_2013020120130228.tif"
 [3,] "data/totals_2013030120130331.tif"
 [4,] "data/totals_2013040120130430.tif"
 [5,] "data/totals_2013050120130531.tif"
 [6,] "data/totals_2013060120130630.tif"
 [7,] "data/totals_2013070120130731.tif"
 [8,] "data/totals_2013080120130831.tif"
 [9,] "data/totals_2013090120130930.tif"
[10,] "data/totals_2013100120131031.tif"
[11,] "data/totals_2013110120131130.tif"
[12,] "data/totals_2013120120131231.tif"
bkatiemills commented 8 years ago

(plus issues of disk I/O when backing up or copying such a beast).

Another question of the right tool (or strategy) for the job. We're at a point now where hardware just blows this problem away at any human scale of data - ie, no researcher is ever in their lifetime going to be able to manually collect enough datasets to make this a problem for data collected by hand; the situation is different with automated data collection systems, but those aren't viable everywhere.

We were safeguarding against users downloading a stack of files, and then not knowing where they had come from... a potential issue I think you'll agree?

Absolutely; that's why I think a bunch about how to make data self-describing, through effective use of metadata. Encoding metadata in your filenames is essentially what you're doing, and that's as good a solution as any. It's a really tough problem that I don't have an answer to yet - see for example, this ongoing discussion.