Open aaronrudkin opened 8 years ago
That's a good idea. It would be a nice down the road enhancement. Would we store the bios as HTML with links embedded or what?
Jeff
On Wed, Sep 14, 2016 at 11:56 PM, Aaron Rudkin notifications@github.com wrote:
Consider this person's biography (Charles Sumner): http://voteview.polisci.ucla.edu/person/9083
Sumner's biography has a variety of interesting facts in it, many of which relate to other things we have pages for. For example, he's "one of the founders of the Free Soil Party in 1848". He was "removed as chairman of the Committee on Foreign Relations in 1871 as a result of differences with President Ulysses S. Grant over policy in Santo Domingo". He was "assaulted in the Senate chamber by Representative Preston Brooks of South Carolina on May 22, 1856".
Ideally, Sumner's biography would link to the Free Soil Party, Ulysses S. Grant, and Preston Brooks.
Doing this live is fairly complex but doing it offline would be fairly feasible, and because biographies are static for a fairly long time, I think worth doing.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JeffreyBLewis/WebVoteView/issues/27, or mute the thread https://github.com/notifications/unsubscribe-auth/AEinZWipi80C_mK5zVIffbaIU-Ayj51cks5qqOwzgaJpZM4J9jYD .
Jeffrey B. Lewis Professor and Chair Department of Political Science University of California, Los Angeles BOX 951472, 4289A Bunche Hall Los Angeles, CA 90095-1472
President The Society for Political Methodology
Given that bios currently have no markup of any kind, yeah we'd just store them with as HTML with links embedded. if we wanted a clean version of the bio for some other reason (e.g. for exporting, although currently we have no means to export an individual member including bio data), we could strip the HTML on an ad hoc basis. The only reason we'd want to store the data twice would be if we wanted to index the biographies for member searches, but I don't see a user story that starts with "I want to search every member who has Ulysses S Grant in their bio, but I don't want Ulysses s Grant" or whatever.
A version of this is done and live; it only replaces party names and exact congressional name matches, and excludes matches where we have more than one congressperson with the same name so to minimize false positives. It does not currently do fuzzy matches, so "Patrick H. Drewry" will not match "Patrick Henry Drewry"--I could fix this fairly easily. It takes about 15-30 minutes to run the process on every member we have (currently running!) and generally is biased towards false negative errors rather than false positive.
The scraper is in /usr/scripts/bioLinkParser.
The db has a new field, bioProcessed, to indicate that the member's bio is processed. To unprocess, we can either apply an HTML stripper to the bio, or simply delete the bio (db.voteview_members.updateMany({"bioProcessed": true}, {"$unset": {"bio": "", "bioProcessed": ""}}))and re-run the bio scraper.
I'd like input from the team as to if there are other texts we should alter; one common biographical element is textual versions of congresses (i.e. "One Hundred and Tenth Congress"). We could link these. Actually, honestly, these are ugly and differ from our style norms elsewhere, so there might be value in dynamically substituting "110th Congress" as well.
Examples of current status: http://128.97.229.160/person/1150 http://128.97.229.160/person/2100 http://128.97.229.160/person/1350 <-- this has a misfire because "Jackson" is a party name, and the bio mentioned Jackson, Louisiana.
I implemented the HTML stripper here and added functionality to re-run an individual bio.
Consider this person's biography (Charles Sumner): http://voteview.polisci.ucla.edu/person/9083
Sumner's biography has a variety of interesting facts in it, many of which relate to other things we have pages for. For example, he's "one of the founders of the Free Soil Party in 1848". He was "removed as chairman of the Committee on Foreign Relations in 1871 as a result of differences with President Ulysses S. Grant over policy in Santo Domingo". He was "assaulted in the Senate chamber by Representative Preston Brooks of South Carolina on May 22, 1856".
Ideally, Sumner's biography would link to the Free Soil Party, Ulysses S. Grant, and Preston Brooks.
Doing this live is fairly complex but doing it offline would be fairly feasible, and because biographies are static for a fairly long time, I think worth doing.