ropensci / dataspice

:hot_pepper: Create lightweight schema.org descriptions of your datasets
https://docs.ropensci.org/dataspice
Other
159 stars 26 forks source link

Multiple webpages for multiple datasets in the same repo? #51

Open tiernanmartin opened 6 years ago

tiernanmartin commented 6 years ago

Exciting new package 🎉

If I have multiple datasets in the same repository is it possible to build a webpage for each dataset? Or is this package only set up for the 1:1 dataset-to-repo use case?

amoeba commented 6 years ago

The sky's the limit, I suppose. Right now, dataspice is one repo <-> one page, where, where your repo can have one or more CSV's in it (more formats in the future). Figuring out what a "dataset" is is very hard so dataspice just ignores that and generates metadata for the files (much easier to determine what a file is).

Would you be willing to describe what your use case looks like a bit? If you have an example repo with a few datasets that'd be super helpful. It's certainly common enough to expect that others have this use case, so thanks!

tiernanmartin commented 6 years ago

Sure!

I was inspired by fivethirtyeight/data to create a repo with the datasets for many unrelated projects. The motivation for doing this is my desire to make it easier for one dataset to be used in multiple unrelated projects. dataspice seems like a great way to generate metadata for these data files, but I would need the ability to create a separate page for each file.

Here's an illustration of the repo architecture that I'd like to create (using two made up datasets):

    .
    ├── my-twitter-mentions/
    │   ├── data/
    │   │   ├── metadata/
    │   │   └── my-twitter-mentions-2017.csv 
    │   ├── docs/
    │   │   └── index.html
    │   └── README.md
    ├── product-user-survey-2018/
    │   ├── data/
    │   │   ├── metadata/
    │   │   └── product-user-survey-2018-results.csv 
    │   ├── docs/
    │   │   └── index.html
    │   └── README.md
    └── README.md
amoeba commented 6 years ago

That's nice and clean. Thanks.

We're currently using a convention of "your data are in ./data" and your example is only a minor tweak to that so I think it'd fit well.

We were discussing how to do the HTML pages over in #46 and Blogdown/Hugo came up as an option for supporting more complex repos (like your example) with indexes of datasets, search, tags, etc.

I'm wondering what a good API would be for getting the metadata from the user for each of the folders of data. Our current API is nice and simple because we're only supporting one folder. If we smushed all of them into a single set of metadata template CSVs I guess that'd be workable both for the Shiny apps and editing the CVSs manually. Not quite sure what the R function API would be but it's definitely possible.

I'll keep chewing on this. Thanks!

cboettig commented 6 years ago

Exploring the look and strategy for this a bit with Hugo. My current sketch supports the following strategies:

{{% cards2.html %}}

to create a rich-card style layout. (I'll add some more layouts as alternate shortcodes).

This will create a landing page for each dataset, and the homepage will create a list view with preview cards for all the datasets on the website. See: https://cboettig.github.io/dataspice-web/ex2/ (source at: http://github.com/cboettig/dataspice-web/

Alternately, for a more one-off approach, hugo can also read JSON from the data/ dir, so one could drop the dataspice.json there and use some different shortcodes to embed the metadata cards onto any page.

I think I can turn this into a hugo theme pretty easily, mostly need to figure out how much CSS style gets hardwired in and how much I can manage to make easily customizable. thoughts ideas? Suggestions for other layout shortcodes? (I'll mock one up that looks more like the dataspice build_site() template where there's no submenu navigation on the cards).

rubenarslan commented 6 years ago

I think the term "dataset" is a bit ambiguous. You mean multiple tables, right? But any given study may e.g. one table to describe locations, one to describe animals, one to describe time series of behaviour of those animals. Therefore, I thought it would be good to give people the freedom to put several codebooks on the same page and intersperse descriptions of how they relate to one another. Using the rmd partial approach, they can also easily choose to put them on separate pages where that makes more sense.

cboettig commented 6 years ago

@rubenarslan Thanks for the comments! Yes, I think dataset is somewhat intentionally ambiguous.
Google's page laying out the Schema.org/Dataset vocabulary on which dataspice is based puts it like this:

Here are some examples of what can qualify as a dataset:

  • A table or a CSV file with some data
  • An organized collection of tables
  • A file in a proprietary format that contains data
  • A collection of files that together constitute some meaningful dataset
  • A structured object with data in some other format that you might want to load into a special tool for processing
  • Images capturing data
  • Files relating to machine learning, such as trained parameters or neural network structure definitions
  • Anything that looks like a dataset to you

So yes, it could be multiple tables, and also things that are not tables. Note that the goal of dataspice is primarily to provide a convenient way of creating and managing a basic schema.org/Dataset description of data, rather than to create websites. Once you have these metadata descriptions in this convenient and standard JSON format, it's relatively easy to create websites with various layouts. dataspice includes some layouts out of the box, but I don't think it's our goal to have a one-size-fits-all-needs-equally-well layout; but rather a framework that could be easily adapted to "whatever looks like a dataset to you".

rubenarslan commented 6 years ago

Do you describe the dataspice high-level idea somewhere where I can read it?

cboettig commented 6 years ago

@rubenarslan Great question! I think we're still consolidating our ideas to some extent, a blog-post in the works now should give a bit more soon.

Meanwhile, there's a little blurb at the top of the README that has the basic gist I think:

The goal of dataspice is to make it easier for researchers to create basic, lightweight and concise metadata files for their datasets. These basic files can then be used to:

  • make useful information available during analysis.
  • create a helpful dataset README webpage.
  • produce more complex metadata formats to aid dataset discovery.
  • Metadata fields are based on schema.org and other metadata standards.

Not sure if that helps. I think we're still figuring out how best to communicate these ideas to different audiences, so feedback is definitely most helpful!