openstates / issues

Having trouble? Looking to contribute? Issues live here!
15 stars 2 forks source link

Text Extraction of CA pdf failing #147

Open rmcarthur opened 3 years ago

rmcarthur commented 3 years ago

I'm working with Mo Hayat with WashingtonAbstract and we're hoping to both utilize and contribute to OpenStates scrapers/API/Data. I only recently began playing around with the various repos available here, and have found only a few instances where our needs are not met. First thing I want to say is THANK YOU for everything you've done an the way in which you've done it. The documentation, use of docker containers and orchestration is fantastic, easy to replicate, and beautiful.

While trying to run text extraction on California PDFs I found that downloading of CA PDF fails due to web link being a direct download with JavaScript function to access document. Compare the following from California and Colorado. The California page actually looks like this:

<!DOCTYPE html>
<!--
To change this license header, choose License Headers in Project Properties.
To change this template file, choose Tools | Templates
and open the template in the editor.
--><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>Download Bill PDF</title>
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  </head>
  <script language="JavaScript">
    function  downLoadPDF() {

      document.getElementById("pdf_link2").click();
    }

  </script><body onload="downLoadPDF()">
<form id="downloadForm" name="downloadForm" method="post" action="/faces/billPdf.xhtml" enctype="application/x-www-form-urlencoded">
<input type="hidden" name="downloadForm" value="downloadForm" />
<script type="text/javascript" src="/faces/javax.faces.resource/jsf.js?ln=javax.faces"></script><input id="pdf_link2" type="submit" name="pdf_link2" value="PDF2" style="margin-left: .8em; visibility:hidden" class="bill_nav_sub_mobile" onclick="mojarra.jsfcljs(document.getElementById('downloadForm'),{'pdf_link2':'pdf_link2','bill_id':'201720180SB1287','version':'20170SB128793ENR'},'');return false" /><input type="hidden" name="javax.faces.ViewState" id="j_id1:javax.faces.ViewState:0" value="H7yRTRWEQLp++LEVhL0tqxAc1WFv5cChUjQzPgQxBNuV6T1v+N3FPi6aonNfPieeW5LKalksXTc3k90bl3cRhf696+Ug7J+QdT6PgQrokuNeGPKtvriP776mCrrM3WKxUvOFVgB3QFr18cwSmjHw9CGzlanJpCsBiKTQV2fdKpIZ9YafSJR8kJ7yOcvQHLERwq//Eexf37g2mxi/6EjV6IgV/TjvBjo2v+N+2KHEc+o2EsliWEzBBmiKcs/IZ1v4UZsU1+qC/OJJPC+P9AkrxsAgO1rRDTmEtRIrkwKd3WMCM9t8mzW8TAYNlsgN9eythJ/nJb04tHp6ixXpiKTPDvqENj/Mw5s0GNjhKy5Hn2rsv+o+8veaVEo3EgD7/M4qBj8gAM3tiv+iWkCeKJmK7C8Mct2vSE2lMf36+AueGErj1buE13uNOak8TXy37tbi/TH6/e18HkPBIsTFOva5EJamZk+aAFbRJU+3/433D/6rG+1vRHalFdTNpRx2uRbHnBD0QdWJsgGJ74/3ykEN2pFVJ1y2LH2xaDnWqYmgFarFGB3lPfpP9jjtOVz056XT7TiiQZE5RsS7aK0YNwS62GKrwiWc3577EgDdfUKTO78O4PT0pkWH5gH9AClXrSLlDX8HHsVw+WzLb8phSsoLqh+MKa6CKe7UBM2Udici2ivsZg6IWThca5Q1QGQpkKktnOywylLqKRdnUjnQ+EtnCrerKe7dFol3aa2/Ksppe4WtcHlad+GHJJeRBYmEKHMsm2xGE1m73Jv/76g7iIYHFh/CptyUtWWuG+LDpDFgqIM+e01minlRlXvqKhyiq5q57A7zCJkU6n4lQtmRa1c7bFNzZklSUGTW/cb8N3gdsEU9PbDm4bNYUIGECfc/f2Wvd4VPoZbbH/uGBmRCohK/86oUauKIEtxN6Iv/gPZ/qg5CfJVynhftiWzBC1o50DPPI2zp3qNZ9JH/v2t5In/guW/EINQyNy1DfCmk3kOfVhFlESXdFeah+8NYUhixzS1dxTx9r8h3OggUevTh+GL2EnTRtr17Ra8rPK+7PhmjdlPw2QjdVWVaqDFnpHhUiksBT2ylv6ygxZHj2mrphQWGJFCEnKhOuPJcWwcz7vEa54I2uTuFVS+l3JLkDEvt69sG0GmpYV0eOtKchqnvyyTyrgrrMWrpR0RBqqV8hocrJ9yS9zlYbeAu0fantNU5LkmjEJtYSCuUO3EBfSrFoqm2NAPZ++9Fw7qXsSNUYhvcYiyZj7+MWrCfRCn/edrgWxtSeFqaqfZQlLscnI9s0Z8fq+gS6L2GMZUiex3fk+2c2JJyaaLNdRQebBE1XwxMhqoLnqTTQkYrd3n5sspSz7oknB7UtcJN1nCBTkXUnA5txzCJrXMg/wLi1Z+wzXZ4w6AFHwg4RTm7AvFBp2UWeTU2+Lh1gMSBRFEWpbcEzLPcBlva/53U+Y74j9+O2ZVWRAZBO5DX942mltggnnlWRUJsmijqg/fJdQkWEVqpmw3NJTDGc1Rs2HufVlicOD0O7kmCXzgnYLFvEbj2pGeVpwrnJwPumGlePxFtZHKm7PS2p1F9VE9OH9Y1LQSV3pova9ED8uU1mQ7bqiPnlLLGHd5vtAK12EDpLanBu9UHYGiL7Qtj7AwIHP6yuMdGnEkalE52yet7A0VDlw7uV4IjRNuloEIbpLzMX8DtfMPgsaGTcNFkJLSQ1OjFq9nrKjkaaVF5w15k1k51y4DH1rvWglzbOMpjxbhQsHdXlkdlpXvdXRBuUeItYDYevR06c/BGoosno2LOpwcQkLS/4tgi4sDIdcKnPqHmQlPYG690fV/7TFbaK/3RgPWA7B3Oa5yyC4ZaD+rg+TLzLZa4GwOQdlRTXSpKri70tMdqa2r/LqP5f9CAvzDH8h8zSQczYYz8Fy49WQEbFHmyfJbgOtKeRCux4qQ6qeJJW9VKFQBio//gkDZZMgFrPaug9IolMb1crFxc3tWB670udcews5WAZTDd1SMsJiLJovbbf/I/XSVkk34oGkcaVWn60ZcbRJAlNBVxi0swavMGwqXsEoOfqSUTQ9a9svHW/ZpsYBF0iUELyt3LvuLtI1rITD3U7NhtkP4d7iJVtU2C64qA5kYCBj1b+tSWuUrS7+tZCXetH/v1bOGOZEVkJQbIM5avVOESi2X5VRTIQuzoCmkiu0PURI+EnFh2t9Xdoxuwwo0dpGxPGYWJ1LKIGqU1K17eHDMxiO5Cw2AU7KFbsHKNU98KqZN1AbqWuBIB8p6F3nhq57hq3ZsRUS+CVG4Twr6Ha43jjwtIC1WTRKqJouRHZa+O76lbS2q0aN9oi6yf2m/leSfvJGxZp0GmvRoGvj8CJHP0xBFtbdlTelQWvSem3S9edN4wOeQg3k8vnoZTiGIsm91B2rt6wZfOuWvKXTKVkqdiGvMGcOyw8DMq9JnK9zD3lNTP5fniVkCd4h0j0IWxaPYBboaqGbagVZYnLcDQpq3sbiSEKqZ2fdXgOlmgZDA+c4/jGFyvAQ5c1j/A/0A3Z3OmUYZsCUUFs7cnwopStDLY01AZbucN740Zc1DAUYzFzlwp6HM7KjI7Dpokxv9XuubKtkCRF/CWDDnC/Vr9sAlkMyCAPJKi0+DdTNA0KMB3/dIQrdLhdm4JwMdI1ltZmkk5NII=" autocomplete="off" />
</form></body>

</html>

The scrapers used for this seem to be a slightly modified version of urllib's functionality; however that limits the use case to directly hosted documents and doesn't allow for javascript execution for pdf downloads like California's.

@jamesturk I assume you know all of this since you wrote all of this, so my question is mostly poking what you would prefer to see done to handle this. I'm happy to write a PR for your approval, but want to align in case you've thought of a better solution.

My thought is to add a selenium style scraper to extract the file when the jurisdiction doesn't provide a direct link for download. Would you prefer this be written in the TextExtraction repo as a common util or do you see need for it to be written generally into your Scraper class? Do other states have this problem? Is it something that has been a roadblock in the past? Please let me know if this isn't the right place for this discussion.

jamesturk commented 3 years ago

This seems to have broken since we had the documents being scraped... thanks for flagging.

So two thoughts here: the openstates scraper itself should probably figure out the actual document link, not link to the page that has JS on it. The text-extraction repo can then remain simpler ideally.

As far as using Selenium, we've avoided it this far, and I'd like to continue if at all possible. Generally it is easier to reverse engineer the JS and figure out how to get the link out, so that's what I'd try first here. I'd need to take a closer look at network traffic to see if there's a straightforward way to get through these. If you have some time and want to investigate I'd be glad to discuss here or give you a Slack invite if that's better for you.

On Tue, Oct 13, 2020 at 1:26 AM Rex McArthur notifications@github.com wrote:

I'm working with Mo Hayat with WashingtonAbstract and we're hoping to both utilize and contribute to OpenStates scrapers/API/Data. I only recently began playing around with the various repos available here, and have found only a few instances where our needs are not met. First thing I want to say is THANK YOU for everything you've done an the way in which you've done it. The documentation, use of docker containers and orchestration is fantastic, easy to replicate, and beautiful.

While trying to run text extraction on California PDFs I found that downloading of CA PDF fails due to web link being a direct download with JavaScript function to access document. Compare the following from California https://leginfo.legislature.ca.gov/faces/billPdf.xhtml?bill_id=201720180SB1287&version=20170SB128793ENR' and Colorado http://leg.colorado.gov/sites/default/files/documents/2017A/bills/2017a_1228sblt_01.pdf. The California page actually looks like this:

<!DOCTYPE html>

Download Bill PDF
The scrapers used for this seem to be a slightly modified version of urllib's functionality; however that limits the use case to directly hosted documents and doesn't allow for javascript execution for pdf downloads like California's. @jamesturk I assume you know all of this since you wrote all of this, so my question is mostly poking what you would prefer to see done to handle this. I'm happy to write a PR for your approval, but want to align in case you've thought of a better solution. My thought is to add a selenium style scraper to extract the file when the jurisdiction doesn't provide a direct link for download. Would you prefer this be written to this repo as a common util or do you see need for it to be written generally into your Scraper class. Please let me know if this isn't the right place for this discussion. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or unsubscribe .
rmcarthur commented 3 years ago

Slack invite would be great. Can you send it to drexmcarthur@gmail.com? Happy to discuss and take a stab at it.

showerst commented 3 years ago

This was on my mind anyway for rewriting the CA scraper to avoid the SQL database, so I asked about GET links at their feedback form here -- http://leginfo.legislature.ca.gov/faces/feedbackDetail.xhtml?primaryFeedbackId=prim1603215435573

Maybe we'll get lucky =)

rmcarthur commented 3 years ago

http://leginfo.legislature.ca.gov/faces/feedbackDetail.xhtml?primaryFeedbackId=prim1603215435573 Looks like they've responded and don't intend to support it.

jamesturk commented 3 years ago

Well then :/

I'm taking a couple days off this week, but I'll find time soon to sit down & seriously figure out how to integrate this.

On Fri, Oct 30, 2020 at 7:12 PM Rex McArthur notifications@github.com wrote:

http://leginfo.legislature.ca.gov/faces/feedbackDetail.xhtml?primaryFeedbackId=prim1603215435573 Looks like they've responded and don't intend to support it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openstates/issues/issues/147#issuecomment-719839370, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAB6YSRYZOOMDEQSZFDWLLSNNB7NANCNFSM4SODFBUQ .

rmcarthur commented 3 years ago

Ping on this. Happy to spend some time sorting it out, just want to know what direction you want to go before I really tackle it.

jamesturk commented 3 years ago

Sorry for the delay- I do think we can aim to pull these in, I'm hoping to land a test harness next week and would like to bring this in afterwards.

On Sun, Nov 15, 2020 at 12:56 AM Rex McArthur notifications@github.com wrote:

Ping on this. Happy to spend some time sorting it out, just want to know what direction you want to go before I really tackle it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openstates/issues/issues/147#issuecomment-727520269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAB6YQBCXS6RVPTXGAVBGLSP5UPFANCNFSM4SODFBUQ .

jamesturk commented 3 years ago

@rmcarthur do you think it'd be possible to mock the request without actually executing the JS? if so I think I'd like to get this into the scraper itself, so that the scraper is actually saving correct links to the PDF.

this is the only state with this specific issue, so I'd prefer for now it live in the california specific code instead of a generic module as well. glad to chat more on Slack/etc. when you have time

jamesturk commented 2 years ago

I managed to circumvent the need for Selenium, os-text-extract had to special case CA data, but it now seems to be extracting fine. Running this again on the latest session.