issues
search
reinventalbany
/
esd-crawl
Web crawler to find data on Empire State Development site
MIT License
0
stars
0
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
report broken links to HTML pages
#54
afeld
closed
1 year ago
0
find all broken links
#53
afeld
closed
1 year ago
2
report broken links from PDFs
#50
afeld
opened
1 year ago
0
check for broken links
#49
afeld
closed
1 year ago
1
report links to PDFs that redirect
#48
afeld
opened
1 year ago
1
find broken links to PDFs
#47
afeld
closed
1 year ago
1
read the MTA Open Data Act
#46
afeld
opened
1 year ago
0
show count of tables per PDF
#45
afeld
opened
1 year ago
0
look for tables that include a "Source"
#44
afeld
opened
1 year ago
0
find pages that link to the Open Data Portal
#43
afeld
opened
1 year ago
0
write up data crawling wish list for ESD
#42
afeld
opened
1 year ago
0
indicate if PDF is scanned
#41
afeld
opened
1 year ago
0
give button in review interface to ignore a whole PDF
#40
afeld
closed
1 year ago
1
indentify tables that span pages
#39
afeld
opened
2 years ago
0
capture number of rows per table
#38
afeld
opened
2 years ago
0
handle scanned PDFs
#37
afeld
opened
2 years ago
1
support crawling other sites
#36
afeld
opened
2 years ago
0
avoid downloading PDFs
#35
afeld
opened
2 years ago
0
set up continuous integration
#34
afeld
opened
2 years ago
0
run tests through GitHub Actions
#33
afeld
opened
2 years ago
0
perform full crawl
#32
afeld
closed
2 years ago
1
add script to get tables from ParseHub files
#31
afeld
closed
2 years ago
1
include page numbers
#30
afeld
closed
2 years ago
0
record PDF references
#29
afeld
opened
2 years ago
0
update Airtable records if they already exist
#28
afeld
closed
2 years ago
0
exclude application forms
#27
afeld
closed
2 years ago
1
avoid putting duplicate records in Airtable
#26
afeld
closed
2 years ago
0
handle duplicate tables
#25
afeld
opened
2 years ago
0
exclude checklists
#24
afeld
opened
2 years ago
0
write up an overview of the crawling
#23
afeld
closed
2 years ago
1
scrape reports using playwright
#22
afeld
opened
2 years ago
1
use scrapy for reports spider
#21
afeld
closed
2 years ago
1
reference the page number of each table
#20
afeld
closed
2 years ago
0
upload results to Airtable
#19
afeld
closed
2 years ago
0
experiment with exporting to Excel
#18
afeld
closed
2 years ago
1
handle PDF redirects
#17
afeld
opened
2 years ago
1
include PDFs from Parsehub
#16
afeld
closed
2 years ago
0
only outline outer border of table
#15
afeld
opened
2 years ago
0
reduce false positives
#14
afeld
opened
2 years ago
1
ensure all file URLs are absolute
#13
afeld
closed
2 years ago
0
find table data
#12
afeld
closed
2 years ago
0
give team process to review discovered tables
#11
afeld
closed
2 years ago
0
move Project over
#10
afeld
closed
2 years ago
0
transfer to Reinvent Albany
#9
afeld
closed
2 years ago
0
gather PDFs from Reports
#8
afeld
closed
2 years ago
1
find all PDFs on the site
#7
afeld
closed
2 years ago
1
find data in PDFs
#6
afeld
closed
2 years ago
0
move AJAX page scraping to spider
#5
afeld
opened
2 years ago
1
pull the title from the PDF itself
#4
afeld
opened
2 years ago
0
look for other data files
#3
afeld
opened
2 years ago
0
Next