ropensci / unconf16

rOpenSci's San Francisco hackathon/unconf 2016
http://unconf16.ropensci.org
24 stars 7 forks source link

Standard(s) for inferring metadata from directory and file names #19

Open HenrikBengtsson opened 8 years ago

HenrikBengtsson commented 8 years ago

Outcome: A new package

UPDATE: This suggestion was work on during rOpenSci Unconf 2016 and resulted in the dirdf package.

Call

@vsbuffalo, @karthik, @gaborcsardi, @richfitz, @jennybc et al., don't you think Unconf15 topic A package for higher-level R metadata extraction deserves a bit more love?

Summary of packages / software

R Packages

EDIT 2016-03-04: Added summarized of packages/software mentioned in last year's thread. Updated with those mentioned in this year's thread.

richfitz commented 8 years ago

Yes! @nicolewhite might possibly be interested too with her port of some string handling things: https://github.com/nicolewhite/pystr

richfitz commented 8 years ago

Though in terms of @vsbuffalo's original topic, I now endorse meaningless filenames backed by a lookup to a key-value store for the metadata.

gaborcsardi commented 8 years ago

Yeah, encoding metadata into file names is a neat trick, but it has its limits....

cboettig commented 8 years ago

I now endorse meaningless filenames backed by a lookup to a key-value store for the metadata.

does that include version information in data identifiers? :-)

cboettig commented 8 years ago

(sorry if that sounded wrong, meant to say that I was curious to understand a bit better when this does or doesn't work; never felt like I had a good idea one way or the other when we were discussing this in terms of data versioning. a good topic for more exploration).

jennybc commented 8 years ago

"Meaningless filenames" fills me with existential dread.

@HenrikBengtsson don't you have something on your R wish list about a class for file paths? Yes here: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/9. That feels possibly related to this.

HenrikBengtsson commented 8 years ago

@jennybc, I'd say it's only somewhat related, but yes, I can imagine that we implement a file metadata API on top of classes like what're proposed in https://github.com/HenrikBengtsson/Wishlist-for-R/issues/9. Though, I don't think we need to figure out the latter in order to make progress on this one.

FYI, I've updated the top comment with a summary of packages/software mentioned here and in last year's thread.

maelle commented 8 years ago

@HenrikBengtsson I currently use my own package for the project I'm working in, where the filenames have a particular structure (villageID_participantID_date_etc + extension). It's not an optimal way of storing data but it's the current state of the project data.

I check that filenames have the right structure depending on the extension and then I check different things based on the logsheet (which files do I expect for each participantID, etc. -- the fact I use the logsheet is the reason I cannot make it public yet) once I've parsed all filenames. I now wonder how you would make this general? Or were you thinking of writing guidelines?

HenrikBengtsson commented 8 years ago

https://usecanvas.com/anonymous/pathmetadata/2VXxpEVm8W2UMIb86RbmoO

jennybc commented 8 years ago

https://github.com/ropenscilabs/dirdf

jcheng5 commented 8 years ago

EFFFFF ME

R's regex DOES support named capture groups! The syntax is just different than what I have used in the past, and perl = TRUE is required. I'll try to submit a pull request--this should let users pass regexes without requiring a separate colnames column.