Request: Database Schema Associated with the Big Dataset

metmuseum / openaccess

The Metropolitan Museum of Art's Open Access Initiative

Creative Commons Zero v1.0 Universal

1.16k stars 207 forks source link

Request: Database Schema Associated with the Big Dataset #6

Open ajgabz opened 7 years ago

ajgabz commented 7 years ago

According to the README file, this big dataset comes from the Met's own internal database. What's the (presumably) relational schema that's being used?

I strongly feel that making the schema public will allow easier and more efficient navigation of this massive dataset (as opposed to dealing with one big 43-column heterogeneous table).

And to help other users with this massive table, I've attached a list of field names in plain-text format. Each field name is on a separate line. field_names.txt

abetusk commented 7 years ago

@ajgabz, thanks for providing that file.

In case anyone wants it, here's a *nix one-liner that will get that header:

head -n1 MetObjects.csv | csvtool col 1- -u TAB - | sed 's/\t/\n/g'

danfowler commented 7 years ago

I'd recommend publishing in Table Schema which can be then embedded in a Data Package.

avitalp commented 7 years ago

@danfowler @abetusk @ajgabz I took a first run at this: https://github.com/avitalp/metmuseum-oa-explore

ewg118 commented 7 years ago

Does the Met's internal database store identifiers in controlled vocabulary systems, like the Getty ULAN, TGN, and AAT? If so, it would be beneficial to include these in the CSV output in order to normalize the data into RDF more efficiently and accurately.

sotojuan commented 7 years ago

How much use would you people get out of putting a SQL file like @avitalp's behind a RESTful API? So you can make GET requests to it and the like.

I mean, they do seem to have an API ([example(http://www.metmuseum.org/api/Collection/additionalImages?crdId=437853)) but it doesn't have any useful information like a title, artist, etc.

It'd be a fun project for me—just want to know if people would find it useful.

VladimirAlexiev commented 7 years ago

@danfowler instead of Table Schema, isn't it better to use the CSVW standard?

danfowler commented 7 years ago

@VladimirAlexiev Either or both would be cool! Also, I think it would be pretty easy to translate to one from the other. I am coming from the project that developed the Table Schema specification, so I would like to try out the dataset with the tools we have that support it.

metasj commented 2 years ago

I'd also like to see this schema maintained / updated as well, along with a sql format (like AvitalP's).

Related useful elements

internal vocabularies used (e.g. for categorization, collection, era, region)
mappings b/t schemas + vocabularies and standard [less-specialized] open schemas + vocabularies used by other museums
- A better solution for hosting/accessing the large files (csv/sql) -- noting avitalp's comments on them quickly running out of monthly download quota

stephenhmarsh commented 2 years ago

Thanks everyone for the great discussion here. The data in the CSV (and everywhere else) is mostly from our TMS database. I personally would love to make a replica of this database public at some point, if we could get approval.