unitedstates / congress

Public domain data collectors for the work of Congress, including legislation, amendments, and votes.
https://github.com/unitedstates/congress/wiki
Creative Commons Zero v1.0 Universal
928 stars 200 forks source link

Bill summary from bills API has no formatting #104

Closed tribble closed 10 years ago

tribble commented 10 years ago

Bill summary in the bills API strips the formatting from Thomas.

Example response for bill hr1204-113:

{
   "results":[
      {
         "bill_id":"hr1204-113",
         "summary":"Aviation Security Stakeholder Participation Act of 2013 - Directs the Assistant Secretary of Homeland Security (Transportation Security Administration [TSA]) to establish in the TSA an Aviation Security Advisory Committee. Requires the Assistant Secretary to consult with the Advisory Committee on aviation security matters. Requires the Advisory Committee to develop, upon the Assistant Secretary's request, recommendations to improve aviation security. Requires the Assistant Secretary to appoint to the Advisory Committee members representing up to 32 member organizations, including air carriers, all cargo air transportation, indirect air carriers, labor organizations representing air carrier employees, aircraft manufacturers, airport operators, general aviation, privacy organizations, the travel industry, airport based businesses, aeronautical repair stations, passenger advocacy groups, the aviation technology security industry, including biometrics, victims of terrorist acts against aviation, and law enforcement and security experts. Establishes within the Advisory Committee: (1) an air cargo security subcommittee; (2) a general aviation subcommittee; (3) an airport perimeter security, exit lane security, and access control subcommittee; (4) a risk-based subcommittee; (5) a security technology subcommittee; and (6) any other subcommittee deemed necessary. Requires each subcommittee to include subject matter experts with relevant expertise."
      }
   ],
   "count":1,
   "page":{
      "count":1,
      "per_page":20,
      "page":1
   }
}

Here is the corresponding page on Thomas: http://thomas.loc.gov/cgi-bin/bdquery/z?d113:H.R.1204:@@@D&summ2=m&amp

Is there a way to keep some of the formatting, such as paragraph breaks? Or perhaps create a new field that contains the summary with all of its markup?

JoshData commented 10 years ago

I think that's on Sunlight's Congress API (https://github.com/sunlightlabs/congress/issues), but the data we're generating here is the cause of the problem.

tribble commented 10 years ago

Sorry, you're right.

However, the issue still stands. I'm happy to try to jump in and contribute, but what would you consider the "right" solution to this problem? Preserving some line breaks?

As it stands now the summary works well for something like search, but it's not very good for human consumption.

konklone commented 10 years ago

I think preserving the line breaks would be the right call. Looks like THOMAS is using a bunch of <p> tags, so we could just convert those to \n\n between blocks, to keep the field plain-text. Anyone who doesn't want the line-breaks for some reason can still strip them out easily enough.

tribble commented 10 years ago

I don't normally work with python, but I'll take a stab at this with test coverage.

JoshData commented 10 years ago

@tribble : Thanks for giving it a shot! Let us know if you run into any questions.

JoshData commented 10 years ago

Fixed by #105. Thanks @tribble!