reinventalbany esd-crawl issues

reinventalbany / esd-crawl

Web crawler to find data on Empire State Development site

MIT License

0 stars 0 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

report broken links to HTML pages

#54 afeld closed 1 year ago
0
find all broken links

#53 afeld closed 1 year ago
2
report broken links from PDFs

#50 afeld opened 1 year ago
0
check for broken links

#49 afeld closed 1 year ago
1
report links to PDFs that redirect

#48 afeld opened 1 year ago
1
find broken links to PDFs

#47 afeld closed 1 year ago
1
read the MTA Open Data Act

#46 afeld opened 1 year ago
0
show count of tables per PDF

#45 afeld opened 1 year ago
0
look for tables that include a "Source"

#44 afeld opened 1 year ago
0
find pages that link to the Open Data Portal

#43 afeld opened 1 year ago
0
write up data crawling wish list for ESD

#42 afeld opened 1 year ago
0
indicate if PDF is scanned

#41 afeld opened 1 year ago
0
give button in review interface to ignore a whole PDF

#40 afeld closed 1 year ago
1
indentify tables that span pages

#39 afeld opened 2 years ago
0
capture number of rows per table

#38 afeld opened 2 years ago
0
handle scanned PDFs

#37 afeld opened 2 years ago
1
support crawling other sites

#36 afeld opened 2 years ago
0
avoid downloading PDFs

#35 afeld opened 2 years ago
0
set up continuous integration

#34 afeld opened 2 years ago
0
run tests through GitHub Actions

#33 afeld opened 2 years ago
0
perform full crawl

#32 afeld closed 2 years ago
1
add script to get tables from ParseHub files

#31 afeld closed 2 years ago
1
include page numbers

#30 afeld closed 2 years ago
0
record PDF references

#29 afeld opened 2 years ago
0
update Airtable records if they already exist

#28 afeld closed 2 years ago
0
exclude application forms

#27 afeld closed 2 years ago
1
avoid putting duplicate records in Airtable

#26 afeld closed 2 years ago
0
handle duplicate tables

#25 afeld opened 2 years ago
0
exclude checklists

#24 afeld opened 2 years ago
0
write up an overview of the crawling

#23 afeld closed 2 years ago
1
scrape reports using playwright

#22 afeld opened 2 years ago
1
use scrapy for reports spider

#21 afeld closed 2 years ago
1
reference the page number of each table

#20 afeld closed 2 years ago
0
upload results to Airtable

#19 afeld closed 2 years ago
0
experiment with exporting to Excel

#18 afeld closed 2 years ago
1
handle PDF redirects

#17 afeld opened 2 years ago
1
include PDFs from Parsehub

#16 afeld closed 2 years ago
0
only outline outer border of table

#15 afeld opened 2 years ago
0
reduce false positives

#14 afeld opened 2 years ago
1
ensure all file URLs are absolute

#13 afeld closed 2 years ago
0
find table data

#12 afeld closed 2 years ago
0
give team process to review discovered tables

#11 afeld closed 2 years ago
0
move Project over

#10 afeld closed 2 years ago
0
transfer to Reinvent Albany

#9 afeld closed 2 years ago
0
gather PDFs from Reports

#8 afeld closed 2 years ago
1
find all PDFs on the site

#7 afeld closed 2 years ago
1
find data in PDFs

#6 afeld closed 2 years ago
0
move AJAX page scraping to spider

#5 afeld opened 2 years ago
1
pull the title from the PDF itself

#4 afeld opened 2 years ago
0
look for other data files

#3 afeld opened 2 years ago
0