openstates / issues

Having trouble? Looking to contribute? Issues live here!
15 stars 2 forks source link

NM: Update votes scraper (was: Fix individual members' votes in House PDFs) #65

Open mileswwatkins opened 6 years ago

mileswwatkins commented 6 years ago

New Mexico serves its votes in PDFs (directory), and we try to parse their tables using the x and y coordinates of the X checkmarks.

Unfortunately, at least for one of the 2018 session's House vote PDFs, the rows in the LXMLized vote PDF don't line up; that is, one of the vote checkmarks has a y coordinate that differs from its member, so the vote can't be attributed.

When this is detected, I'm setting the scraper to throw out individual-member counts, and keep vote totals. But the individual-member scraping is so close, some additional logic may be able to salvage these cases.

cc @cliftonmcintosh

mileswwatkins commented 6 years ago

Here's a clean-ish version of the scraped rows for one problematic House vote PDF:

(Pdb) pprint(OrderedDict(sorted(rows.items(), key=lambda t: t[0])))
OrderedDict([(17, [('OFFICIAL ROLL CALL', 373, 168)]),
             (39, [('NEW MEXICO HOUSE OF REPRESENTATIVES', 275, 363)]),
             (60,
              [('Second Regular Session of the 53rd Legislature', 282, 350)]),
             (82, [('2018 Regular Session', 376, 162)]),
             (110, [('LEGISLATIVE DAY 2', 129, 158)]),
             (132, [('RCS# 14', 623, 67)]),
             (176, [('HB 1/ec', 427, 59)]),
             (198,
              [('F I N A L  P A S S A G E  with emergency clause', 278, 358)]),
             (224,
              [('YEAS: 67', 120, 73),
               ('NAYS: 0', 294, 65),
               ('EXCUSED: 0', 476, 99),
               ('ABSENT: 3', 647, 87)]),
             (248, [('REPRESENTATIVE', 54, 136), ('REPRESENTATIVE', 460, 136)]),
             (270,
              [('X', 229, 10),
               ('Adkins', 53, 45),
               ('X', 635, 10),
               ('Louis', 459, 36)]),
             (292,
              [('X', 229, 10),
               ('Alcon', 53, 38),
               ('X', 635, 10),
               ('Lundstrom', 459, 71)]),
             (314,
              [('X', 229, 10),
               ('Armstrong, D.', 53, 93),
               ('X', 635, 10),
               ('Maestas', 459, 57)]),
             (336,
              [('X', 229, 10),
               ('Armstrong, Gail', 53, 104),
               ('X', 635, 10),
               ('Maestas Barnes', 459, 108)]),
             (357,
              [('X', 229, 10),
               ('Baldonado', 53, 72),
               ('X', 635, 10),
               ('Martínez, Javier', 459, 107)]),
             (379,
              [('X', 229, 10),
               ('Bandy', 53, 43),
               ('X', 635, 10),
               ('Martinez, Rudy', 459, 101)]),
             (401,
              [('X', 229, 10),
               ('Brown', 53, 43),
               ('X', 816, 10),
               ('McCamley', 459, 71)]),
             (422,
              [('X', 229, 10),
               ('Chasey', 53, 51),
               ('X', 635, 10),
               ('McQueen', 459, 65)]),
             (444,
              [('X', 229, 10),
               ('Clahchischilliage', 53, 112),
               ('X', 635, 10),
               ('Montoya', 459, 58)]),
             (465, [('X', 326, 10)]),
             (466, [('Cook', 53, 35), ('X', 635, 10), ('Nibert', 459, 40)]),
             (487,
              [('X', 229, 10),
               ('Crowder', 53, 57),
               ('X', 635, 10),
               ('Powdrell-Culbert', 459, 111)]),
             (509,
              [('X', 229, 10),
               ('Dines', 53, 38),
               ('X', 635, 10),
               ('Rehm', 459, 40)]),
             (531,
              [('X', 229, 10),
               ('Dodge', 53, 44),
               ('X', 635, 10),
               ('Roch', 459, 35)]),
             (552,
              [('X', 229, 10),
               ('Dow', 53, 30),
               ('X', 635, 10),
               ('Rodella', 459, 51)]),
             (574,
              [('X', 229, 10),
               ('Egolf', 53, 34),
               ('X', 635, 10),
               ('Romero', 459, 53)]),
             (596,
              [('X', 229, 10),
               ('Ely', 53, 21),
               ('X', 635, 10),
               ('Roybal Caballero', 459, 115)]),
             (617,
              [('X', 229, 10),
               ('Ezzell', 53, 40),
               ('X', 635, 10),
               ('Rubio', 459, 39)]),
             (639,
              [('X', 229, 10),
               ('Fajardo', 53, 51),
               ('X', 635, 10),
               ('Ruiloba', 459, 51)]),
             (661,
              [('X', 229, 10),
               ('Ferrary', 53, 48),
               ('X', 635, 10),
               ('Salazar, Nick', 459, 88)]),
             (683,
              [('X', 229, 10),
               ('Gallegos, David', 53, 106),
               ('X', 635, 10),
               ('Salazar, Tomás', 459, 105)]),
             (704,
              [('X', 229, 10),
               ('Gallegos, Doreen', 53, 117),
               ('X', 635, 10),
               ('Sariñana', 459, 60)]),
             (726,
              [('X', 229, 10),
               ('Garcia Richard', 53, 100),
               ('X', 635, 10),
               ('Scott', 459, 34)]),
             (748,
              [('X', 229, 10),
               ('Garcia, Harry', 53, 89),
               ('X', 635, 10),
               ('Small', 459, 38)]),
             (769,
              [('X', 229, 10),
               ('García, M.P.', 53, 84),
               ('X', 635, 10),
               ('Smith', 459, 38)]),
             (791,
              [('X', 229, 10),
               ('Gentry', 53, 45),
               ('X', 635, 10),
               ('Stapleton', 459, 63)]),
             (813,
              [('X', 229, 10),
               ('Gomez', 53, 48),
               ('X', 635, 10),
               ('Strickler', 459, 54)]),
             (834,
              [('X', 229, 10),
               ('Gonzales', 53, 63),
               ('X', 635, 10),
               ('Sweetser', 459, 63)]),
             (856,
              [('X', 229, 10),
               ('Hall', 53, 26),
               ('X', 635, 10),
               ('Thomson', 459, 63)]),
             (878,
              [('X', 229, 10),
               ('Harper', 53, 46),
               ('X', 635, 10),
               ('Townsend', 459, 69)]),
             (899,
              [('X', 229, 10),
               ('Herrell', 53, 44),
               ('X', 635, 10),
               ('Trujillo, Carl', 459, 80)]),
             (921,
              [('X', 229, 10),
               ('Johnson', 53, 57),
               ('X', 635, 10),
               ('Trujillo, Christine', 459, 112)]),
             (943,
              [('X', 229, 10),
               ('Larrañaga', 53, 68),
               ('X', 635, 10),
               ('Trujillo, Jim', 459, 76)]),
             (965,
              [('X', 229, 10),
               ('Lente', 53, 38),
               ('X', 635, 10),
               ('Trujillo, Linda', 459, 89)]),
             (986, [('Lewis', 53, 38), ('X', 635, 10), ('Wooley', 459, 50)]),
             (987, [('X', 324, 10)]),
             (1008,
              [('X', 229, 10),
               ('Little', 53, 32),
               ('X', 635, 10),
               ('Youngblood', 459, 80)]),
             (1038,
              [('CERTIFIED CORRECT TO THE BEST OF OUR KNOWLEDGE', 354, 467)]),
             (1060, [('(Speaker)', 749, 72)]),
             (1082, [('(Chief Clerk)', 730, 93)])])

The scraper warns:

12:51:41 WARNING pupa: No vote found for ('X', 326, 10)
12:51:41 WARNING pupa: No vote found for ('X', 635, 10)
12:51:41 WARNING pupa: No vote found for ('X', 635, 10)
12:51:41 WARNING pupa: No vote found for ('X', 324, 10)
estaub commented 6 years ago

Yeah, that's certainly nudgable. I'm curious about the PDF source; I wonder if it's OCR, or they used some WYSIWYG form generator and were sloppy.

In-vincible commented 6 years ago

@mileswwatkins is this solved?

mileswwatkins commented 6 years ago

@In-vincible, no, my PR just changed the code to skip votes that were troublesome, which is why I spun off this ticket.

https://github.com/openstates/openstates/pull/2103/files#diff-d57e2d82487395e5ff5349aea8c56550R273

schneidy commented 4 years ago

Spacing issues still exist within PDFs. Sometimes a yes vote is categorized as a no vote. Example: https://www.nmlegis.gov/Sessions/19%20Regular/votes/HB0256HVOTE.PDF

jessemortenson commented 8 months ago

Adding context to this old issue: