Standardize bulk data file access

mattheiler commented 3 years ago

Right now, bulk data file naming and structure has no consistency across states, nor manifest as to what's available. I'd propose, should it prove viable, the following file structure, accessible via an FTP server:

/<state>
  /<legislature>
    /<session>
      bill_abstracts.csv
      bill_actions.csv
      bill_sources.csv
      bill_sponsorships.csv
      bill_version_links.csv
      bill_versions.csv
      vote_counts.csv
      vote_people.csv
      vote_sources.csv
      votes.csv

FTP provides a standard protocol for easy manual and programmatic access and authentication/authorization. The consistent file naming and structure promotes data organization and allows for users/systems to predictably consume the data. Other data formats could be able to be placed in the same directory, using the same naming conventions.

jamesturk commented 3 years ago

Thanks for sharing this idea here.

Let's take one step back and discuss the problem we're trying to solve, there are a few issues I see with this technical implementation but understanding the reasons for these decisions and desired outcome would help plan something that could probably meet these needs and others.

A couple of questions to hopefully address:

It'd also be good to understand the needs/use case here. We have an API for programmatic access, and bulk data isn't updated as regularly (monthly or weekly instead of continuously) so it'd be good to understand what use cases we aren't currently accounting for. Can you elaborate on how you came to want this?
Is there a reason you request FTP instead of HTTPS access? I can't see us offering FTP but if there are things you're trying to accomplish that way I'm open to finding ways to handle them.
The directory layout of the individual zip files offered is uniform and in a very similar format to what you suggest (split apart into per-session zip files) I'm trying to think of times that downloading the zip wouldn't suffice but coming up short, can you elaborate on the use cases for having the bare CSVs instead of a similar structure within the zip files?

I can see a few options for improvements coming out of this such as providing a JSON manifest or getting us back to more predictable file names for the zip files. Glad to work to try to find a solution here.

mattheiler commented 3 years ago

I’ve been using the v3 API, but quickly hit a 429 status code. Now that’s a separate issue, and one that could be easily fixed, but I would also imagine it solves a different use case entirely: smaller, targeted data. My primary use case is parsing the older data. It’s much more efficient to pull down bulk data at rest and then run an expensive ETL process. That same data can be tagged and reused, transactionally. I can build retry policies, construct semaphores to rate limit client access, paginate the data, and then process it, but that seems expensive and prone to concurrency issues. Secondarily, consistent and predictable access to the bulk data seems necessary to break down the data into consumable chunks - easier to read/understand and program for. I see the API as a great tool to get smaller updates when caches are invalidated or otherwise, outside of the bulk ETL process use case.

mattheiler commented 3 years ago

An alternative to FTP could be HTTPS, but a client would need to be built in order to extract the data. I have the US Census Tiger FTP server in mind: download all the data you need, legacy included, and then process it. I imagine state legislative data rarely changes after the session expires.

mattheiler commented 3 years ago

What zips are available and consistent naming conventions are missing, from what can tell, unless I bypassed something.

jamesturk commented 3 years ago

openstates.org/data has the zip files i’m referencing - have you tried using those? On Mar 16, 2021, 10:50 PM -0400, Matt Heiler , wrote:

What zips are available and consistent naming conventions are missing, from what can tell, unless I bypassed something. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

mattheiler commented 3 years ago

Yes, that's what I was looking at.

https://data.openstates.org/csv/latest/MN_2017s1_csv_1WmTzoKh4tCRQRBBIR6DDI.zip "Bill & Vote CSV Data" is the data I want, but the links are inconsistent and CSV isn't the best format for my use case.

"PostgreSQL Data Dump" has the bulk data I want, but requires a bit of setup to access... I could setup Docker and access the database through the container.

"Legacy Data Dumps" is the data and format I'd like (at a quick glance), but it looks like it's stale and won't be updated.

jamesturk commented 3 years ago

Can you elaborate a bit on these two things:

the need for a custom client for HTTPS vs. FTP, I'm not sure I understand this part yet either- in both cases you'd need a hostname and a list of files to fetch, or traversal policy, no?
re: CSV is not the best format for your use case: Your proposal as I understood it was for CSV. Are you looking for JSON instead/as well? I have experimented with that here: https://openstates.org/data/session-json/ but haven't yet committed to it long-term.

If the links didn't have the random hashes on the end would the data.openstates.org/csv/ stuff be close to what you want?

mattheiler commented 3 years ago

Sure!

A.) For HTTPS, I would need to write a library/application that downloads the files from each resource path. That's fine, but what's missing is something like a manifest/index that tells me what's in the directory/path/route. If I'm trying to walk through all the available data, I need to know that resource at "~/MT/116/2/bills.json" is available, or I could keep traversing until I hit a 404. With FTP, you could download the chunk/slice you need using an FTP client (e.g., FileZilla). I can request all files for Montana. You could run into uses cases for manifests/indexes using FTP, but at least you can download the entire folder recursively and then process it. HTTPS works, if that's easier to manage.

B.) JSON would be preferable to CSV, that would be great. Right now, I'd have to scrape the DOM and get all the links, ignore the checksum hashes, and handle the following conventions in order to extract metadata and determine what's available in bulk, programmatically:

https://data.openstates.org/json/latest/ia_2017-2018_json_6DPvA516TSfYZac9ZMiGgL.zip
https://data.openstates.org/json/latest/in_2018ss1_json_4hvwnIwvxrPipjyLnUEK4w.zip
https://data.openstates.org/json/latest/ky_2017RS_json_16dEpSSGPNrp8rRgJESd0s.zip
https://data.openstates.org/json/latest/ky_2019SS_json_1Kz72iU6XG1BxDWraUu8WF.zip
https://data.openstates.org/json/latest/la_2018%201st%20Extraordinary%20Session_json_Ne5BCiGPXvgHhTbE2OjRq.zip

So, I guess given A and B, we can maybe adjust the request to: Bulk Data API w/ JSON resources? My thought was that FTP file dump seems like an easier/direct solution to bulk data, but either does the job.

Edit: FTP protocols allow for the ls command, which would negate the need for a file directory manifest/index.

jamesturk commented 3 years ago

I agree & think a manifest that contains links to all of the files would be doable here, and the main addition to what we already have (the manifest could list both JSON & CSV versions).

mattheiler commented 3 years ago

Would something like this work?

swagger: "2.0"
info:
  version: "1.0.0"
  title: "Open States Bulk Data API"
paths:
  /sessions:
    get:
      produces:
      - "application/json"
      parameters:
      - name: "state"
        in: "query"
        type: "string"
        enum:
        - "AL"
        - "AK"
        - "AS"
        - ".."
        - "WY"
      - name: "legislature"
        in: "query"
        type: "integer"
      - name: "number"
        in: "query"
        type: "integer"
        enum:
        - 1
        - 2
      - name: "classification"
        in: "query"
        type: "string"
        enum:
        - "regular"
        - "special_or_extraordinary"
        - "interim"
      responses:
        "200":
          description: "success"
          schema:
            type: "array"
            items:
              $ref: "#/definitions/Session"
definitions:
  Session:
    type: "object"
    properties:
      state:
        type: "string"
      legislature:
        type: "integer"
      number:
        type: "integer"
      classification:
        type: "string"
        enum:
        - "regular"
        - "special_or_extraordinary"
        - "interim"
      files:
        type: "array"
        items:
          $ref: "#/definitions/SessionFile"
  SessionFile:
    type: "object"
    properties:
      url:
        type: "string"
      name:
        type: "string"
      type:
        type: "string"
        enum:
        - "bill_abstracts"
        - "bill_actions"
        - "bill_sources"
        - "bill_sponsorships"
        - "bill_version_links"
        - "bill_versions"
        - "bills"
        - "vote_counts"
        - "vote_people"
        - "vote_sources"
        - "votes"

mattheiler commented 3 years ago

Or if you wanted to return the raw JSON... I suppose that's what you had in mind, reading back.

jamesturk commented 3 years ago

Yes, I think writing out a static file would be the preferred way to handle this since it doesn't change frequently enough IMO to warrant an API.

Would you need much more than this:

{
  "files": [
    {
      "jurisdiction": "ocd-jurisdiction/country:us/state:nc",
      "session": "2019",
      "type": "JSON Bulk Download",
      "updated_at": "2020-01-01T12:34:56Z",
      "url": "https://data.openstates.org/path/to/file"
    },
    {
      "jurisdiction": "ocd-jurisdiction/country:us/state:nc",
      "session": "2020",
      "type": "JSON Bulk Download",
      "updated_at": "2020-01-01T12:34:56Z",
      "url": "https://data.openstates.org/path/to/file"
    }
  ]
}

mattheiler commented 3 years ago

Sorry for the delayed response. I think session number would implicitly cover legislature number. State would be covered by jurisdiction. Session classification would be nice, if not provided in the files (e.g., regular, special_or_extraordinary, interim). Would there be one file with all of the bulk data (i.e., bill_abstracts, bill_actions, bill_sources, etc.)? If so, that should be enough!

openstates / enhancement-proposals

Standardize bulk data file access #17