vikjam / ui-policy

Unemployment policy at the state level
MIT License
0 stars 0 forks source link

Format data #1

Open vikjam opened 6 years ago

vikjam commented 6 years ago

Here's the current output for an example file. Should probably format it in a more standard way.

{
  "AL": [
    "0.59%",
    "6.74%",
    "2.70%"
  ],
  "AK": [
    "1.00%",
    "5.40%",
    "3.40%"
  ],
  "AZ": [
    "0.02%",
    "5.86%",
    "2.00%"
  ],
  "AR": [
    "1.00%",
    "6.90%",
    "3.80%"
  ],
  "CA": [
    "1.50%",
    "6.20%",
    "3.40%"
  ],
  "CO": [
    "1.00%",
    "5.40%",
    "1.70%"
  ],
  "CT": [
    "1.90%",
    "6.80%",
    "3.70%"
  ],
  "DE": [
    "0.10%",
    "8.00%",
    "2.60%"
  ],
  "DC": [
    "1.60%",
    "7.00%",
    "2.70%"
  ],
  "FL": [
    "1.03%",
    "5.40%",
    "2.70%"
  ],
  "GA": [
    "0.025%",
    "5.40%",
    "2.62%"
  ],
  "HI": [
    "1.20%",
    "5.40%",
    "4.00%"
  ],
  "ID": [
    "0.96%",
    "6.80%",
    "3.36%"
  ],
  "IL": [
    "0.70%",
    "8.40%",
    "3.80%"
  ],
  "IN": [
    "0.70%",
    "9.50%",
    "2.50%"
  ],
  "IA": [
    "0.00%",
    "9.00%",
    "1.90%"
  ],
  "KS": [
    "0.11%",
    "7.40%",
    "4.00%"
  ],
  "KY": [
    "1.00%",
    "10.00%",
    "2.70%"
  ],
  "LA": [
    "0.11%",
    "6.20%",
    "InAvg%"
  ],
  "ME": [
    "0.86%",
    "7.95%",
    "3.02%"
  ],
  "MD": [
    "2.20%",
    "13.50%",
    "2.60%"
  ],
  "MA": [
    "1.26%",
    "12.27%",
    "2.83%"
  ],
  "MI": [
    "0.06%",
    "10.30%",
    "2.70%"
  ],
  "MN": [
    "0.50%",
    "9.40%",
    "2.91%"
  ],
  "MS": [
    "0.85%",
    "5.40%",
    "2.70%"
  ],
  "MO": [
    "0.00%",
    "9.75%",
    "3.51%"
  ],
  "MT": [
    "0.82%",
    "6.12%",
    "InAvg%"
  ],
  "NE": [
    "0.00%",
    "8.66%",
    "2.50%"
  ],
  "NV": [
    "0.25%",
    "5.40%",
    "2.95%"
  ],
  "NH": [
    "0.01%",
    "7.00%",
    "3.70%"
  ],
  "NJ": [
    "0.50%",
    "5.80%",
    "2.80%"
  ],
  "NM": [
    "0.05%",
    "5.40%",
    "2.00%"
  ],
  "NY": [
    "1.50%",
    "9.90%",
    "4.10%"
  ],
  "NC": [
    "0.24%",
    "6.84%",
    "1.20%"
  ],
  "ND": [
    "0.20%",
    "10.00%",
    "1.37%"
  ],
  "OH": [
    "0.70%",
    "9.60%",
    "2.70%"
  ],
  "OK": [
    "0.30%",
    "7.50%",
    "1.00%"
  ],
  "OR": [
    "2.20%",
    "5.40%",
    "3.30%"
  ],
  "PA": [
    "2.68%",
    "10.82%",
    "3.70%"
  ],
  "PR": [
    "2.40%",
    "5.40%",
    "3.30%"
  ],
  "RI": [
    "1.69%",
    "9.79%",
    "2.46%"
  ],
  "SC": [
    "0.10%",
    "11.28%",
    "2.87%"
  ],
  "SD": [
    "0.00%",
    "9.50%",
    "1.20%"
  ],
  "TN": [
    "0.50%",
    "10.00%",
    "2.70%"
  ],
  "TX": [
    "0.78%",
    "8.25%",
    "2.70%"
  ],
  "UT": [
    "0.40%",
    "9.40%",
    "InAvg%"
  ],
  "VT": [
    "1.30%",
    "8.40%",
    "1.00%"
  ],
  "VA": [
    "0.77%",
    "6.87%",
    "3.17%"
  ],
  "VI": [
    "0.10%",
    "9.00%",
    "3.00%"
  ],
  "WA": [
    "0.49%",
    "6.00%",
    "InAvg%"
  ],
  "WV": [
    "1.50%",
    "7.50%",
    "2.70%"
  ],
  "WI": [
    "0.27%",
    "9.80%",
    "3.60%"
  ],
  "WY": [
    "0.67%",
    "10.00%",
    "InAvg%"
  ]
}
vikjam commented 6 years ago

Uploading JSON samples for multiple months and years: json-sample.zip

ryanedmundkessler commented 6 years ago

Finally got your code up and running. I tried to modify it a bit to extract the "weekly benefit amount" and "number of benefit weeks" fields in addition to the tax rates. But I couldn't seem to get these fields to format nicely.

ryanedmundkessler commented 6 years ago

In other news, I emailed the department of labor to see if they had this information in spreadsheet form so that we could bypass the scraping. I got excited when they shared an Excel file. But, perhaps not surprisingly, the Excel file appears to contain a ton of errors. The PDFs that we're scraping appear to do a much better job capturing changes in the laws.

vikjam commented 6 years ago

Feel free to push your changes and I'll take a look! :grin:

vikjam commented 6 years ago

Hi @ryanedmundkessler, I think I figured it out. I'll upload the changes tomorrow. It helped a lot to use the spreadsheet option in tabula.

https://github.com/chezou/tabula-py/issues/13

ryanedmundkessler commented 6 years ago

Dang! That sounds great. Looking forward to playing around with it!

vikjam commented 6 years ago

Had some mixed success. Here's some example output from the "good" PDFs:

state min benefit max benefit num weeks min rate max rate new rate monthyr
AL $45 $265 15-26 0.59% 6.74% 2.70% January2017
AK $56- 128 $370- 442 16-26 1.00% 5.40% 2.10% January2017
AZ $126 $240 13-26 0.03% 8.91% 2.00% January2017
AR $81 $451 9-20 0.10% 6.00% 2.90% January2017
CA $40 $450 14-26 1.50% 6.20% 3.40% January2017
CO $25 $516 or $568 13-26 0.62% 8.15% 1.70% January2017
CT $15-30 $616- 691 26 1.90% 6.80% 4.30% January2017
DE $20 $330 24-26 0.10% 8.00% 1.90% January2017
DC $50 $425 26 1.60% 7.00% 2.70% January2017
FL $32 $275 9-12 0.10% 5.40% 2.70% January2017
GA $44 $330 6-14 0.025% 5.40% 2.62% January2017
HI $5 $592 26 0.00% 5.60% 2.40% January2017
ID $72 $410 10-26 0.425% 5.40% 1.488% January2017
IL $51-77 $449- 613 26 0.55% 7.75% 3.55% January2017
IN $37 $390 26 0.505% 7.474% 2.50% January2017
IA $66-81 $447- 548 7-26 0.00% 8.00% 1.00% January2017
KS $118 $474 10-26 0.20% 7.60% 2.70% January2017
KY $39 $415 15-26 1.00% 10.00% 2.70% January2017
LA $10 $247 26 0.10% 6.20% InAvg% January2017
ME $71- 106 $410- 615 15-26 0.57% 5.40% 2.04% January2017
MD $50-90 $430 26 0.30% 7.50% 2.60% January2017
MA $37-55 $742- 1,103 10-30 0.73% 11.13% 1.87% January2017
MI $141- 171 $362 14-20 0.06% 10.30% 2.70% January2017
MN $26 $440- 683 11-26 0.10% 9.00% 1.59% January2017
MS $30 $235 13-26 0.00% 5.40% 1.00% January2017
MO $35 $320 8-20 0.00% 9.750% 3.51% January2017
MT $151 $510 8-28 0.00% 6.12% InAvg% January2017
NE $70 $392 1-26 0.00% 5,40% 1.25% January2017
NV $16 $426 12-26 0.25% 5.40% 2.95% January2017
NH $32 $427 26 0.10% 7.50% 1.70% January2017
NJ $100- 115 $677 1-26 0.50% 5.80% 2.80% January2017
NM $79- 119 $425- 475 14-26 0.33% 5.40% InAvg% January2017
NY $100 $425 26 1.10% 8.50% 3.40% January2017
NC $15 $350 12 0.06% 5.76% 1.00% January2017
ND $43 $630 12-26 0.28% 10.72% 1.62% January2017
OH $118 $435- 587 20-26 0.30% 8.70% 2.70% January2017
OK $16 $510 16-26 0.10% 5.50% 1.50% January2017
OR $138 $590 3-26 1.11% 5.40% 2.60% January2017
PA 68-76 $561- 569 18-26 2.801% 10.8937% 3.6785% January2017
PR $7 $133 26 2.40% 5.40% 3.30% January2017
RI $49-99 $566- 707 17-26 1.69% 9.79% 2.27% January2017
SC $42 $326 13-20 0.06% 5.46% 1.39% January2017
SD $28 $380 15-26 0.00% 9.50% 1.20% January2017
TN $30 $275 13-26 0.01% 10.00% 2.70% January2017
TX $66 $493 10-26 0.45% 7.47% 2.70% January2017
UT $29 $524 10-26 0.20% 7.20% InAvg% January2017
VT $77 $458 21-26 1.30% 8.40% 1.00% January2017
VA $60 $378 12-26 0.17% 6.27% 2.57% January2017
VI $33 $480 13-26 1.50% 6.00% 2.00% January2017
WA $162 $681 1-26 0.10% 5.70% InAvg% January2017
WV $24 $424 26 1.50% 7.50% 2.70% January2017
WI $54 $370 14-26 0.05% 12.00% 3.25% January2017
vikjam commented 6 years ago

Revisions in https://github.com/vikjam/ui-policy/pull/2

vikjam commented 6 years ago

CSVs of "successful" extractions.

CSVs (2004 - 2016).zip

ryanedmundkessler commented 6 years ago

Dang! This is great. I'm going to get this repo running on my machine today so that I can contribute to this (I had been experimenting with my own code, using your code as a guide)

ryanedmundkessler commented 6 years ago

Update: I've committed some cosmetic changes to extract.py

ryanedmundkessler commented 6 years ago

I've restructured the repo a bit: