project-open-data / catalog-generator

A multi-format tool to generate and maintain agency.gov/data catalog files.
http://project-open-data.github.com/catalog-generator/
21 stars 18 forks source link

The JSON generated by tool does not conform to the standard #20

Open cew821 opened 10 years ago

cew821 commented 10 years ago

The generator makes every field "standard JSON" i.e.

{ "keywords":"this, that, the other" }

This is not compliant with the standard, which has more specific requirements for how to represent the objects. For example:

{ "keywords": ["this", "that", "the other"] }

See @dwcaraway's helpful schema: https://github.com/dwcaraway/podschema/blob/master/schema/schema.json

Because the JSON generated by this tool isn't in the right format, I'm not sure it will be that useful? I guess better than nothing.

I wonder if the generator could be made to make better output? Specifically:

I can try to help with this, but I'm having a hard time figuring where in the library this is done. I'm a little familiar with Backbone, but not enough to quickly identify where "the work" of processing the input into JSON is happening. Can you point me in the right direction?

dwcaraway commented 10 years ago

@cew821 Glad the JSON schema is useful. I just issued a pull request (https://github.com/project-open-data/project-open-data.github.io/pull/172) to project open data to get the JSON schema in as a common format that we'll express the Common Core Metadata requirements in.

Just an FYI, in addition to automatically validating JSON (see http://dwcaraway.github.io/podschema/validate.html) the schema can be used to generate a form automatically (see http://dwcaraway.github.io/podschema/form.html) which can easily be hooked to a database and can pull in the latest JSON schema from project-open-data so it's always up-to-date.

gbinal commented 10 years ago

Thanks. I'm also seeing this. I definitely think that this is a significant resource but I'm not sure if the best use of time is to fix each of these elements or focus on alternate paths like building off of Dave's schema.

@benbalter - any thoughts on this?

gbinal commented 10 years ago

To update, below is a sample of an output. It looks like the issue of parsing into arrays comes into play with 'keyword', 'theme', and 'references'; but also there's a related issue of how 'distribution' work correctly with this. I'm not sure if the best move is to address them in conjunction or if that's mixing up too much logic.

[
    {
        "title": "data 1 ",
        "description": "what it is",
        "keyword": "key1, key2",
        "modified": "2012-01-15",
        "publisher": "GSA",
        "contactPoint": "John Smith",
        "mbox": "john.smith@gsa.gov",
        "identifier": "gsa-1123",
        "accessLevel": "public",
        "accessLevelComment": "In order to access this dataset, visit 123 washington st.  ",
        "bureauCode": "011:22",
        "programCode": "011:111",
        "accessURL": "http://www.agency.gov/data.xml",
        "webService": "http://www.agency.gov/data.json",
        "format": "application/xml",
        "license": "CC-0",
        "spatial": "United States",
        "temporal": "2011",
        "theme": "energy, education",
        "dataDictionary": "http://www.agency.gov/data/data.html",
        "dataQuality": "true",
        "accrualPeriodicity": "monthly",
        "distribution": "notsurewhattoput?",
        "landingPage": "http://www.agency.gov/data_this",
        "language": "en-US",
        "PrimaryITInvestmentUII": "12-121234121",
        "references": "http://www.agency.gov/data.pdf, http://www.agency.gov/otherhub/data.doc",
        "issued": "2012-01-22",
        "systemOfRecords": "http://www.agency.gov/oira/data-record.html"
    }
]
gbinal commented 10 years ago

Charles, it seems to me that the only proactive problem with the file generation is the issue of comma separated v. array of strings for keywords, themes, and references. I have split that off as a specific issue #21. Do you think I'm missing anything else crucial? [e.g., I think that changing the date format for an end user to a proper date format would be good but is not essential.]

cew821 commented 10 years ago

There are a few additional fields that need to be in arrays of strings, not strings, regardless of how many items are in the array. These include:

Also, dataQuality needs to be a boolean, not a string, i.e. true not "true".