issues
search
ukwa
/
ukwa-heritrix
The UKWA Heritrix3 custom modules and Docker builder.
10
stars
7
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
WARC fields not populated in Kafka crawl log
#90
anjackson
opened
1 year ago
0
Crawler ignoring robots.txt on a particular site
#89
anjackson
opened
1 year ago
1
Add unit test for difficult sitemap case
#88
anjackson
opened
1 year ago
0
Add NLA/Pandas crawler trap rules
#87
anjackson
closed
1 year ago
0
Timsort comparison error for specific robots.txt URL
#86
anjackson
opened
2 years ago
1
Ensure blocked sites are fully blocked.
#85
anjackson
closed
1 year ago
1
Domain crawl tripping 'abuse' alerts
#84
anjackson
opened
2 years ago
1
Firewall issues when crawling some websites
#83
crarugal
opened
2 years ago
0
minor readme typos, link corrections etc
#82
ldbiz
closed
2 years ago
0
Update security.txt well-known URI
#81
anjackson
closed
2 years ago
1
Create a url-frontier Frontier implementation
#80
anjackson
opened
2 years ago
1
Add faster/parallel queues for known CDNs
#79
anjackson
opened
2 years ago
1
Add support for Redis-compatible alternatives like KvRocks
#78
anjackson
opened
3 years ago
0
Verify Redis implementation
#77
anjackson
opened
3 years ago
0
Quotas reset on restart
#76
anjackson
opened
3 years ago
0
Add support for configuring the Redis DB via an environment variable
#75
anjackson
opened
3 years ago
0
Broken checkpoints
#74
anjackson
closed
3 years ago
1
Update User Agent URL
#73
anjackson
closed
3 years ago
0
DC2021 issues
#72
anjackson
closed
1 year ago
4
Fixed JSON property name to fix length always being zero
#71
KarlXerri
closed
3 years ago
1
Add 'The Internet' test image to the suite
#70
anjackson
opened
3 years ago
0
URL launch with launch timestamp not forcing recrawl
#69
anjackson
opened
3 years ago
2
Fixed toCDXLine digest field
#68
KarlXerri
closed
3 years ago
1
Add module to look for well-known URIs when we hit a new host
#67
anjackson
closed
3 years ago
1
Transfer cookies from WebRender to Heritrix
#66
anjackson
opened
3 years ago
0
NPE blocked restart of service
#65
anjackson
opened
3 years ago
0
Crawler obeying nofollow directive when instructed to ignore robots.txt
#64
anjackson
closed
3 years ago
1
Add crawler ID to Kafka log output
#63
anjackson
opened
3 years ago
0
Spot GONE links and annotate them?
#62
anjackson
opened
3 years ago
1
Cope with exceptions when scanning for metrics.
#61
anjackson
opened
3 years ago
0
Allow number of cleaner threads to be set
#60
anjackson
opened
3 years ago
0
Add 'high-speed' sheets to help cope with larger sites that are okay with more traffic
#59
anjackson
closed
3 years ago
1
Bump junit from 4.10 to 4.13.1
#58
dependabot[bot]
closed
4 years ago
0
Lock contention in OutbackCDX PoolingHttpClientConnectionManager
#57
anjackson
opened
4 years ago
12
Docker-compose might need to declare additional Kafka listeners kafka-console* to work
#56
Radtoo
opened
4 years ago
0
Add a Cuckoo filter as an alternative to the Bloom filter
#55
anjackson
opened
4 years ago
0
Crawler pauses when processing large numbers of candidate URLs
#54
anjackson
opened
4 years ago
0
Propagate launchTimestamp:XXXX annotation too
#53
anjackson
opened
4 years ago
0
Issues with launches causing NPE and WebRender fails no falling back on H3?
#52
anjackson
opened
4 years ago
6
Check for problems with metadata WARC records
#51
anjackson
opened
5 years ago
0
Ensure quota resets work with server quotas
#50
anjackson
opened
5 years ago
1
NPE in StatisticsTracker because of `null` `SourceTag`
#49
anjackson
closed
5 years ago
0
Change the actual crawl job name when starting up?
#48
anjackson
opened
5 years ago
1
Problem unpausing after taking Kafka off line
#47
anjackson
opened
5 years ago
0
Record seed configuration updates as annotations
#46
anjackson
opened
5 years ago
0
Add additional RSS/Atom/ROME extractor
#45
anjackson
opened
5 years ago
0
Support mildly malformed and compressed Sitemaps
#44
anjackson
closed
5 years ago
4
Always get prerequisites that are resolved via redirects?
#43
anjackson
closed
5 years ago
1
How to ensure sitemaps and multi-level sitemaps get refreshed?
#42
anjackson
closed
5 years ago
2
Links not being extracted from site maps
#41
anjackson
closed
5 years ago
1
Next