nyphilarchive / PerformanceHistory

New York Philharmonic Performance History Metadata
Creative Commons Zero v1.0 Universal
129 stars 26 forks source link

Create JSON converter and publish converted JSON #11

Closed freethejazz closed 7 years ago

freethejazz commented 8 years ago

I started working on this today after hearing about the project from the FiveThirtyEight podcast. I've got the JSON converter and resulting JSON, but will need to do some further testing and cleanup before being ready to offer it as a contribution. A few decisions I made which I'd like to get your feedback on:

  1. In the Programs directory, I created two subdirectories xml and json. Does that seem reasonable for organizing and separating the formats?
  2. There are a few places where the xml structure is extra verbose after a direct translation to JSON. Does the sample below seem like a reasonable format? If given free reign, I'd rename of worksInfo to works, which would match the programs/program and soloists/soloist structure, but I'm not super opinionated on it. Just a thought
{
  "programs": [
    {
      "id": "38e072a7-8fc9-4f9a-8eac-3957905c0002",
      "programID": "3853",
      "orchestra": "New York Philharmonic",
      "season": "1842-43",
      "concertInfo": {
        "eventType": "Subscription Season",
        "Location": "Manhattan, NY",
        "Venue": "Apollo Rooms",
        "Date": "1842-12-07T05:00:00Z",
        "Time": "8:00PM"
      },
      "worksInfo": [
        {
          "ID": "8834*4",
          "composerName": "Weber,  Carl  Maria Von",
          "workTitle": "OBERON",
          "movement": "\"Ozean, du Ungeheuer\" (Ocean, thou mighty monster), Reiza (Scene and Aria), Act II",
          "conductorName": "Timm, Henry C.",
          "soloists": [
            {
              "soloistName": "Otto, Antoinette",
              "soloistInstrument": "Soprano",
              "soloistRoles": "S"
            },
            /* more soloists */
          ]
        },
        /* more works */
      ]
    },
    /* more programs */
  ]
}

This is where I've got things now: https://github.com/freethejazz/PerformanceHistory/tree/feature/add-json-converter

mjbrodsky commented 8 years ago

This is awesome...thanks for pitching in! Regarding your questions,

  1. Yes this is exactly how I would arrange it.
  2. In terms of structure this makes sense. I see what you mean regarding concerts, works, and soloists rather than concertInfo, worksInfo, and soloists. Let's change to that for the JSON. I would like to leave the XML alone though so we don't create a hassle for anyone who pulls in the future to update an existing app.

I look forward to the contribution! After you make the pull request I will accept, test on my side and integrate the script into our local workflow so the XML and JSON are generated simultaneously in the future.

hrecht commented 8 years ago

This would be awesome. One question - the previous commit had separate work IDs and movement IDs. I think that was a little more intuitive and easier to work with than the current set up. Is the current workid*movementid format final or still being modified?

mjbrodsky commented 8 years ago

We decided to keep them concatenated because a movement never stands alone - it is always a part of a work. Also, if multiple movements are played on the same work, we repeat the entire work for each movement rather than nesting multiple movements within the work.

It also serves our own purposes...we have an internal reporting tool that is very inflexible, and concatenating these IDs made it a bit easier to match up the Github data to other data I was trying to report from our backend system.

I'm open to making modifications, though. What if we did this:

<work ID="1234" workMov="1234*2">
     <composerName />
     <workTitle />
     <movement ID="2" />
     <conductorName />
     <soloists />
</work>

A little redundant maybe but I'm ok doing it if you think it would help people out. Or if you have a different idea let me know.

freethejazz commented 8 years ago

Maybe this is an opportunity to split movement and work out in the JSON representation? It could be another case of adjusting fields in a newer representation that doesn't get back-ported. Something like:

"worksInfo": [
        {
          "workID": "8834",
          "movementID": "4",
          "composerName": "Weber,  Carl  Maria Von",
          "workTitle": "OBERON",
          "movement": "\"Ozean, du Ungeheuer\" (Ocean, thou mighty monster), Reiza (Scene and Aria), Act II",
          "conductorName": "Timm, Henry C.",
          "soloists": [
            {
              "soloistName": "Otto, Antoinette",
              "soloistInstrument": "Soprano",
              "soloistRoles": "S"
            },
            /* more soloists */
          ]
        },
reshamas commented 8 years ago

@mjbrodsky What's the status on the data conversion?

freethejazz commented 8 years ago

@reshama Whoops! I got side tracked at some point after realizing I was likely handling concerts incorrectly. I was assuming there would be only one concert of each program, but this is explicitly not true. The main issue with that is that right now, the concerts key will sometimes be an object (if there's a single concert) and sometimes be an array (multiple concerts).

I should be able to update that and submit a PR by Wednesday evening. If you want a head start with the json data as is: https://github.com/freethejazz/PerformanceHistory/tree/feature/add-json-converter/Programs/json

freethejazz commented 8 years ago

Went ahead and did it tonight @reshama. If you want to get your hands on the fully working JSON files before they make it into this repo, the previous link I posted will direct you to the updated version

nyphil commented 8 years ago

Thank you, @freethejazz and @reshama, for your contributions. Currently we are searching for a new Digital Archives manager. Is there anything specific you need from us to move this along? What are some potential implementations of a json data set?

freethejazz commented 8 years ago

Hey @nyphil! I've submitted a pull request including both the converted JSON and a script that can be run to update the JSON files when the XML is updated. The one thing I forgot to add was some information in the top level readme stating that the JSON files exist. I'll update that tonight.

Once that's done, all you'd need to do is merge the pull request and then the greater community would have access to the JSON data set. If you're not super familiar with GitHub, I'd be more than happy to walk you through the process.

The reason you'd want a JSON version of the performance data is that it's generally more straightforward to consume from many applications. Many developers versed in web technologies are familiar with JSON, as it's the de facto standard for passing information between servers and browsers. This would greatly reduce the amount of friction involved in starting up a new project that leverages the program data.

nyphil commented 8 years ago

Thanks Jonathan! We'll test it out and do our own commit to the master once everything is working.