Open vikjam opened 7 years ago
Uploading JSON samples for multiple months and years: json-sample.zip
Finally got your code up and running. I tried to modify it a bit to extract the "weekly benefit amount" and "number of benefit weeks" fields in addition to the tax rates. But I couldn't seem to get these fields to format nicely.
In other news, I emailed the department of labor to see if they had this information in spreadsheet form so that we could bypass the scraping. I got excited when they shared an Excel file. But, perhaps not surprisingly, the Excel file appears to contain a ton of errors. The PDFs that we're scraping appear to do a much better job capturing changes in the laws.
Feel free to push your changes and I'll take a look! :grin:
Hi @ryanedmundkessler, I think I figured it out. I'll upload the changes tomorrow. It helped a lot to use the spreadsheet
option in tabula.
Dang! That sounds great. Looking forward to playing around with it!
Had some mixed success. Here's some example output from the "good" PDFs:
state | min benefit | max benefit | num weeks | min rate | max rate | new rate | monthyr |
---|---|---|---|---|---|---|---|
AL | $45 | $265 | 15-26 | 0.59% | 6.74% | 2.70% | January2017 |
AK | $56- 128 | $370- 442 | 16-26 | 1.00% | 5.40% | 2.10% | January2017 |
AZ | $126 | $240 | 13-26 | 0.03% | 8.91% | 2.00% | January2017 |
AR | $81 | $451 | 9-20 | 0.10% | 6.00% | 2.90% | January2017 |
CA | $40 | $450 | 14-26 | 1.50% | 6.20% | 3.40% | January2017 |
CO | $25 | $516 or $568 | 13-26 | 0.62% | 8.15% | 1.70% | January2017 |
CT | $15-30 | $616- 691 | 26 | 1.90% | 6.80% | 4.30% | January2017 |
DE | $20 | $330 | 24-26 | 0.10% | 8.00% | 1.90% | January2017 |
DC | $50 | $425 | 26 | 1.60% | 7.00% | 2.70% | January2017 |
FL | $32 | $275 | 9-12 | 0.10% | 5.40% | 2.70% | January2017 |
GA | $44 | $330 | 6-14 | 0.025% | 5.40% | 2.62% | January2017 |
HI | $5 | $592 | 26 | 0.00% | 5.60% | 2.40% | January2017 |
ID | $72 | $410 | 10-26 | 0.425% | 5.40% | 1.488% | January2017 |
IL | $51-77 | $449- 613 | 26 | 0.55% | 7.75% | 3.55% | January2017 |
IN | $37 | $390 | 26 | 0.505% | 7.474% | 2.50% | January2017 |
IA | $66-81 | $447- 548 | 7-26 | 0.00% | 8.00% | 1.00% | January2017 |
KS | $118 | $474 | 10-26 | 0.20% | 7.60% | 2.70% | January2017 |
KY | $39 | $415 | 15-26 | 1.00% | 10.00% | 2.70% | January2017 |
LA | $10 | $247 | 26 | 0.10% | 6.20% | InAvg% | January2017 |
ME | $71- 106 | $410- 615 | 15-26 | 0.57% | 5.40% | 2.04% | January2017 |
MD | $50-90 | $430 | 26 | 0.30% | 7.50% | 2.60% | January2017 |
MA | $37-55 | $742- 1,103 | 10-30 | 0.73% | 11.13% | 1.87% | January2017 |
MI | $141- 171 | $362 | 14-20 | 0.06% | 10.30% | 2.70% | January2017 |
MN | $26 | $440- 683 | 11-26 | 0.10% | 9.00% | 1.59% | January2017 |
MS | $30 | $235 | 13-26 | 0.00% | 5.40% | 1.00% | January2017 |
MO | $35 | $320 | 8-20 | 0.00% | 9.750% | 3.51% | January2017 |
MT | $151 | $510 | 8-28 | 0.00% | 6.12% | InAvg% | January2017 |
NE | $70 | $392 | 1-26 | 0.00% | 5,40% | 1.25% | January2017 |
NV | $16 | $426 | 12-26 | 0.25% | 5.40% | 2.95% | January2017 |
NH | $32 | $427 | 26 | 0.10% | 7.50% | 1.70% | January2017 |
NJ | $100- 115 | $677 | 1-26 | 0.50% | 5.80% | 2.80% | January2017 |
NM | $79- 119 | $425- 475 | 14-26 | 0.33% | 5.40% | InAvg% | January2017 |
NY | $100 | $425 | 26 | 1.10% | 8.50% | 3.40% | January2017 |
NC | $15 | $350 | 12 | 0.06% | 5.76% | 1.00% | January2017 |
ND | $43 | $630 | 12-26 | 0.28% | 10.72% | 1.62% | January2017 |
OH | $118 | $435- 587 | 20-26 | 0.30% | 8.70% | 2.70% | January2017 |
OK | $16 | $510 | 16-26 | 0.10% | 5.50% | 1.50% | January2017 |
OR | $138 | $590 | 3-26 | 1.11% | 5.40% | 2.60% | January2017 |
PA | 68-76 | $561- 569 | 18-26 | 2.801% | 10.8937% | 3.6785% | January2017 |
PR | $7 | $133 | 26 | 2.40% | 5.40% | 3.30% | January2017 |
RI | $49-99 | $566- 707 | 17-26 | 1.69% | 9.79% | 2.27% | January2017 |
SC | $42 | $326 | 13-20 | 0.06% | 5.46% | 1.39% | January2017 |
SD | $28 | $380 | 15-26 | 0.00% | 9.50% | 1.20% | January2017 |
TN | $30 | $275 | 13-26 | 0.01% | 10.00% | 2.70% | January2017 |
TX | $66 | $493 | 10-26 | 0.45% | 7.47% | 2.70% | January2017 |
UT | $29 | $524 | 10-26 | 0.20% | 7.20% | InAvg% | January2017 |
VT | $77 | $458 | 21-26 | 1.30% | 8.40% | 1.00% | January2017 |
VA | $60 | $378 | 12-26 | 0.17% | 6.27% | 2.57% | January2017 |
VI | $33 | $480 | 13-26 | 1.50% | 6.00% | 2.00% | January2017 |
WA | $162 | $681 | 1-26 | 0.10% | 5.70% | InAvg% | January2017 |
WV | $24 | $424 | 26 | 1.50% | 7.50% | 2.70% | January2017 |
WI | $54 | $370 | 14-26 | 0.05% | 12.00% | 3.25% | January2017 |
Revisions in https://github.com/vikjam/ui-policy/pull/2
CSVs of "successful" extractions.
Dang! This is great. I'm going to get this repo running on my machine today so that I can contribute to this (I had been experimenting with my own code, using your code as a guide)
Update: I've committed some cosmetic changes to extract.py
I've restructured the repo a bit:
Here's the current output for an example file. Should probably format it in a more standard way.