openstates / enhancement-proposals

Open States Enhancement Proposals
1 stars 3 forks source link

Standardize bulk data file access #17

Open mattheiler opened 3 years ago

mattheiler commented 3 years ago

Right now, bulk data file naming and structure has no consistency across states, nor manifest as to what's available. I'd propose, should it prove viable, the following file structure, accessible via an FTP server:

/<state>
  /<legislature>
    /<session>
      bill_abstracts.csv
      bill_actions.csv
      bill_sources.csv
      bill_sponsorships.csv
      bill_version_links.csv
      bill_versions.csv
      vote_counts.csv
      vote_people.csv
      vote_sources.csv
      votes.csv

FTP provides a standard protocol for easy manual and programmatic access and authentication/authorization. The consistent file naming and structure promotes data organization and allows for users/systems to predictably consume the data. Other data formats could be able to be placed in the same directory, using the same naming conventions.

jamesturk commented 3 years ago

Thanks for sharing this idea here.

Let's take one step back and discuss the problem we're trying to solve, there are a few issues I see with this technical implementation but understanding the reasons for these decisions and desired outcome would help plan something that could probably meet these needs and others.

A couple of questions to hopefully address:

I can see a few options for improvements coming out of this such as providing a JSON manifest or getting us back to more predictable file names for the zip files. Glad to work to try to find a solution here.

mattheiler commented 3 years ago

I’ve been using the v3 API, but quickly hit a 429 status code. Now that’s a separate issue, and one that could be easily fixed, but I would also imagine it solves a different use case entirely: smaller, targeted data. My primary use case is parsing the older data. It’s much more efficient to pull down bulk data at rest and then run an expensive ETL process. That same data can be tagged and reused, transactionally. I can build retry policies, construct semaphores to rate limit client access, paginate the data, and then process it, but that seems expensive and prone to concurrency issues. Secondarily, consistent and predictable access to the bulk data seems necessary to break down the data into consumable chunks - easier to read/understand and program for. I see the API as a great tool to get smaller updates when caches are invalidated or otherwise, outside of the bulk ETL process use case.

mattheiler commented 3 years ago

An alternative to FTP could be HTTPS, but a client would need to be built in order to extract the data. I have the US Census Tiger FTP server in mind: download all the data you need, legacy included, and then process it. I imagine state legislative data rarely changes after the session expires.

mattheiler commented 3 years ago

What zips are available and consistent naming conventions are missing, from what can tell, unless I bypassed something.

jamesturk commented 3 years ago

openstates.org/data has the zip files i’m referencing - have you tried using those? On Mar 16, 2021, 10:50 PM -0400, Matt Heiler , wrote:

What zips are available and consistent naming conventions are missing, from what can tell, unless I bypassed something. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

mattheiler commented 3 years ago

Yes, that's what I was looking at.

https://data.openstates.org/csv/latest/MN_2017s1_csv_1WmTzoKh4tCRQRBBIR6DDI.zip "Bill & Vote CSV Data" is the data I want, but the links are inconsistent and CSV isn't the best format for my use case.

"PostgreSQL Data Dump" has the bulk data I want, but requires a bit of setup to access... I could setup Docker and access the database through the container.

"Legacy Data Dumps" is the data and format I'd like (at a quick glance), but it looks like it's stale and won't be updated.

jamesturk commented 3 years ago

Can you elaborate a bit on these two things:

If the links didn't have the random hashes on the end would the data.openstates.org/csv/ stuff be close to what you want?

mattheiler commented 3 years ago

Sure!

A.) For HTTPS, I would need to write a library/application that downloads the files from each resource path. That's fine, but what's missing is something like a manifest/index that tells me what's in the directory/path/route. If I'm trying to walk through all the available data, I need to know that resource at "~/MT/116/2/bills.json" is available, or I could keep traversing until I hit a 404. With FTP, you could download the chunk/slice you need using an FTP client (e.g., FileZilla). I can request all files for Montana. You could run into uses cases for manifests/indexes using FTP, but at least you can download the entire folder recursively and then process it. HTTPS works, if that's easier to manage.

B.) JSON would be preferable to CSV, that would be great. Right now, I'd have to scrape the DOM and get all the links, ignore the checksum hashes, and handle the following conventions in order to extract metadata and determine what's available in bulk, programmatically:

https://data.openstates.org/json/latest/ia_2017-2018_json_6DPvA516TSfYZac9ZMiGgL.zip
https://data.openstates.org/json/latest/in_2018ss1_json_4hvwnIwvxrPipjyLnUEK4w.zip
https://data.openstates.org/json/latest/ky_2017RS_json_16dEpSSGPNrp8rRgJESd0s.zip
https://data.openstates.org/json/latest/ky_2019SS_json_1Kz72iU6XG1BxDWraUu8WF.zip
https://data.openstates.org/json/latest/la_2018%201st%20Extraordinary%20Session_json_Ne5BCiGPXvgHhTbE2OjRq.zip

So, I guess given A and B, we can maybe adjust the request to: Bulk Data API w/ JSON resources? My thought was that FTP file dump seems like an easier/direct solution to bulk data, but either does the job.

Edit: FTP protocols allow for the ls command, which would negate the need for a file directory manifest/index.

jamesturk commented 3 years ago

I agree & think a manifest that contains links to all of the files would be doable here, and the main addition to what we already have (the manifest could list both JSON & CSV versions).

mattheiler commented 3 years ago

Would something like this work?

swagger: "2.0"
info:
  version: "1.0.0"
  title: "Open States Bulk Data API"
paths:
  /sessions:
    get:
      produces:
      - "application/json"
      parameters:
      - name: "state"
        in: "query"
        type: "string"
        enum:
        - "AL"
        - "AK"
        - "AS"
        - ".."
        - "WY"
      - name: "legislature"
        in: "query"
        type: "integer"
      - name: "number"
        in: "query"
        type: "integer"
        enum:
        - 1
        - 2
      - name: "classification"
        in: "query"
        type: "string"
        enum:
        - "regular"
        - "special_or_extraordinary"
        - "interim"
      responses:
        "200":
          description: "success"
          schema:
            type: "array"
            items:
              $ref: "#/definitions/Session"
definitions:
  Session:
    type: "object"
    properties:
      state:
        type: "string"
      legislature:
        type: "integer"
      number:
        type: "integer"
      classification:
        type: "string"
        enum:
        - "regular"
        - "special_or_extraordinary"
        - "interim"
      files:
        type: "array"
        items:
          $ref: "#/definitions/SessionFile"
  SessionFile:
    type: "object"
    properties:
      url:
        type: "string"
      name:
        type: "string"
      type:
        type: "string"
        enum:
        - "bill_abstracts"
        - "bill_actions"
        - "bill_sources"
        - "bill_sponsorships"
        - "bill_version_links"
        - "bill_versions"
        - "bills"
        - "vote_counts"
        - "vote_people"
        - "vote_sources"
        - "votes"
mattheiler commented 3 years ago

Or if you wanted to return the raw JSON... I suppose that's what you had in mind, reading back.

jamesturk commented 3 years ago

Yes, I think writing out a static file would be the preferred way to handle this since it doesn't change frequently enough IMO to warrant an API.

Would you need much more than this:

{
  "files": [
    {
      "jurisdiction": "ocd-jurisdiction/country:us/state:nc",
      "session": "2019",
      "type": "JSON Bulk Download",
      "updated_at": "2020-01-01T12:34:56Z",
      "url": "https://data.openstates.org/path/to/file"
    },
    {
      "jurisdiction": "ocd-jurisdiction/country:us/state:nc",
      "session": "2020",
      "type": "JSON Bulk Download",
      "updated_at": "2020-01-01T12:34:56Z",
      "url": "https://data.openstates.org/path/to/file"
    }
  ]
}
mattheiler commented 3 years ago

Sorry for the delayed response. I think session number would implicitly cover legislature number. State would be covered by jurisdiction. Session classification would be nice, if not provided in the files (e.g., regular, special_or_extraordinary, interim). Would there be one file with all of the bulk data (i.e., bill_abstracts, bill_actions, bill_sources, etc.)? If so, that should be enough!