issues
search
ukwa
/
webarchive-discovery
WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
116
stars
25
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Changed LanguageAnalyser to langid
#318
lasztoth
closed
3 months ago
0
Changed LanguageAnalyser to langid
#317
lasztoth
closed
3 months ago
0
SolrFields.SOLR_COLLECTION vs SolrFields.COLLECTION, same string value
#316
vkush
opened
3 months ago
0
"collection-id" vs "collection_id" in WARCIndexerCommandOptions.java
#315
vkush
opened
3 months ago
0
Add whitespace between div elements.
#314
thomasegense
opened
5 months ago
0
Reverting source_file_path back to value it was in 3.2.
#313
thomasegense
closed
7 months ago
2
Bump org.opensearch:opensearch from 1.3.9 to 2.11.1 in /warc-indexer
#312
dependabot[bot]
opened
11 months ago
0
warc-indexer needs file.encoding="UTF8"
#311
tokee
opened
1 year ago
1
Problem with log4j missing on h020
#310
anjackson
opened
1 year ago
0
Bump itextpdf from 5.5.12 to 5.5.13.3 in /digipres-tika
#309
dependabot[bot]
closed
1 year ago
0
Do we still need source_file_path?
#308
anjackson
closed
5 months ago
5
Improve the JSONL output
#307
anjackson
opened
1 year ago
0
H020 upgrade Tika 2
#306
anjackson
closed
1 year ago
0
Upgrade Apache Tika to version 2
#305
anjackson
closed
1 year ago
1
Upgrade DROID via Nanite
#304
anjackson
closed
1 year ago
1
Bump opensearch from 1.1.0 to 2.5.0 in /warc-indexer
#303
dependabot[bot]
closed
1 year ago
1
Exception in thread "main" java.lang.NoSuchFieldError: LUCENE_8_8_2
#302
steph-nb
closed
1 year ago
3
Heuristic fix of charset issues
#301
tokee
opened
2 years ago
2
Add WARC compressed record length to the extraction
#300
anjackson
opened
2 years ago
1
Add option for JSONL output
#299
anjackson
closed
1 year ago
3
Upgrade to Hadoop 3
#298
anjackson
opened
2 years ago
0
Support Spark including Spark SQL
#297
anjackson
opened
2 years ago
5
Record turns up with hilariously inaccurate date
#296
anjackson
opened
2 years ago
0
Error indexing WARCs
#295
VictorHarbo
closed
2 years ago
1
Bump jsoup from 1.14.2 to 1.15.3 in /warc-indexer
#294
dependabot[bot]
closed
1 year ago
0
Use CommonGrams to speed up queries that contain stop words
#293
anjackson
opened
2 years ago
0
Bump itextpdf from 5.2.0 to 5.5.12 in /digipres-tika
#292
dependabot[bot]
closed
2 years ago
0
Support for Arabic language in warc-indexer -> Solr fields
#291
thomasegense
opened
2 years ago
2
Add meta stats fields
#290
tokee
opened
2 years ago
0
warc-indexer. video mp4 file classified as "other"
#289
thomasegense
opened
2 years ago
5
Create Solr 8/9 schemas
#288
tokee
opened
2 years ago
3
Extract links_videos (and links_sounds?)
#287
tokee
opened
2 years ago
1
Generalize rules for skipping content
#286
tokee
opened
2 years ago
0
Enable url_norm by default
#285
anjackson
opened
2 years ago
2
Warc-Indexer remove port :80 from url/links when normalising.
#284
thomasegense
opened
2 years ago
0
Host links validation
#283
tokee
closed
2 years ago
2
Bump stanford-corenlp from 4.0.0 to 4.4.0 in /warc-nlp
#282
dependabot[bot]
closed
2 years ago
0
Links_hosts field normalize error
#281
thomasegense
closed
2 years ago
2
Bump xmlgraphics-commons from 1.4 to 2.6 in /digipres-tika
#280
dependabot[bot]
closed
2 years ago
0
Improve MAVEN build Performance
#279
SilverSteven
closed
2 years ago
1
Bump xercesImpl from 2.12.0 to 2.12.2 in /warc-indexer
#278
dependabot[bot]
closed
2 years ago
0
Bump log4j-core from 2.17.0 to 2.17.1 in /warc-indexer
#277
dependabot[bot]
closed
2 years ago
0
Bump log4j-core from 2.16.0 to 2.17.0 in /warc-indexer
#276
dependabot[bot]
closed
2 years ago
0
Bump log4j-core from 2.15.0 to 2.16.0 in /warc-indexer
#275
dependabot[bot]
closed
2 years ago
0
Bump log4j-core from 2.13.2 to 2.15.0 in /warc-indexer
#274
dependabot[bot]
closed
2 years ago
0
Make tmp usage H020/H3 compatible
#273
anjackson
opened
2 years ago
0
Make the HdfsFileHasher H3/H020 compatible
#272
anjackson
closed
2 years ago
1
Add collections to field that can be updated atomically
#271
anjackson
opened
3 years ago
4
Add support for resource WARC records (produced by e.g. warcit)
#270
tokee
closed
2 years ago
3
moving from elastic to opensearch
#269
aponb
closed
2 years ago
2
Next