issues
search
ukwa
/
ukwa-heritrix
The UKWA Heritrix3 custom modules and Docker builder.
10
stars
7
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Ensure refusal of robots.txt recrawls does not invalidate cached robots.txt info
#40
anjackson
closed
5 years ago
4
Retire-and-awaken queues rather than emitting all over-quota URIs.
#39
anjackson
closed
5 years ago
2
Shift to async REST API for web rendering
#38
anjackson
opened
5 years ago
0
Switch to a 'Scope Oracle' model
#37
anjackson
opened
5 years ago
0
Only reset sheets if the sheets are being modified
#36
anjackson
closed
5 years ago
1
Re-crawl logic causes over-crawling of sites with many page-level Targets
#35
anjackson
closed
5 years ago
9
Blocked re-crawl of robots.txt causing failure cascade for host
#34
anjackson
closed
5 years ago
1
Cope better with partially-failed Web Render events
#33
anjackson
closed
5 years ago
4
Error sending messages to topic with Kafka
#32
ivandonofrio
closed
5 years ago
4
Quieten down or resolve cookie warnings
#31
anjackson
opened
5 years ago
1
NPE in Kafka handling caused crawl to ceace
#30
anjackson
closed
5 years ago
1
Ensure quotas are cleared properly
#29
anjackson
closed
5 years ago
5
Pass DOM from WrenderProcessor along to the extractor(s)
#28
anjackson
opened
5 years ago
2
Odd 304 errors
#27
anjackson
closed
5 years ago
2
Quota resets are not working because sheet association was broken for HTTPS
#26
anjackson
closed
5 years ago
4
Verify that the continous crawler is working
#25
anjackson
closed
5 years ago
3
Create new/updated wrender module to use webrender-puppeteer
#24
anjackson
closed
5 years ago
1
Allow separate Requested/Discovered/Accepted URI streams?
#23
anjackson
opened
5 years ago
2
Extend URL Reciever to allow different event stores to be used
#22
anjackson
opened
5 years ago
1
Ensure partition offsets are being recorded properly
#21
anjackson
closed
5 years ago
3
Ensure we keep crawl logs files
#20
anjackson
closed
6 years ago
1
Add some link farms to the block list
#19
anjackson
closed
5 years ago
1
Prevent Kryo warnings
#18
anjackson
opened
6 years ago
0
Synchronise crawl scope across all Heritrix workers
#17
anjackson
closed
6 years ago
4
NullPointerExceptions killing ToeThreads
#16
anjackson
closed
6 years ago
7
Scoping not being applied correctly.
#15
anjackson
closed
5 years ago
1
Add "Scope+N Hops" scoping support
#14
anjackson
opened
6 years ago
0
Add white-list/black-list support
#13
anjackson
opened
6 years ago
2
ClamAV processor should timeout
#12
anjackson
closed
6 years ago
3
Check de-duplication does not reference warc/revisit records
#11
anjackson
closed
5 years ago
1
Complete testing of initial streaming crawler prototype
#10
anjackson
closed
5 years ago
2
Document new Dockerized crawler development workflow
#9
anjackson
opened
7 years ago
0
Reset caps when seeds appear
#8
anjackson
closed
5 years ago
2
Avoid attempting to parse clearly irrelevant URIs
#7
anjackson
opened
7 years ago
0
Fix up report stats in processors
#6
anjackson
opened
8 years ago
0
Add viaHeritrix download option to WrenderProcessor
#5
anjackson
closed
5 years ago
1
Thread contention in uk.bl.wap.modules.deciderules.CompressibilityDecideRule
#4
anjackson
opened
8 years ago
1
WARCViralWriterProcessor and revisits?
#3
PsypherPunk
closed
9 years ago
1
ExtractorJson: NoSuchMethodError
#2
PsypherPunk
closed
10 years ago
1
Should only load GeoIP2 database once.
#1
anjackson
closed
10 years ago
1
Previous