issues
search
lintool
/
warcbase
Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161
stars
47
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
visualization: add file menu
#262
hellowsummer
closed
7 years ago
1
Use Apache commons io utils for more robust array copying
#261
zackwang
closed
7 years ago
0
WARCRecord NotSerializableException when trying to get rid of duplicate pages
#260
dportabella
opened
7 years ago
1
add extract text from html
#259
daithang1111
opened
7 years ago
0
fix the build error when run 'mvn clean package appassembler:assemble…
#258
cheng10
opened
7 years ago
0
load a warc archive, filter it, and produce another warc archive
#257
dportabella
opened
8 years ago
4
Issue #255 - Fixed NoSuchFieldFoundException in IngestFiles
#256
dedocibula
opened
8 years ago
0
NoSuchFieldException in org.warcbase.data.HBaseTableManager
#255
dedocibula
opened
8 years ago
0
Memory Issues on Large WARC Files
#254
ianmilligan1
opened
8 years ago
0
issue-252: Also allow XHTML through in keepValidPages
#253
anjackson
closed
8 years ago
0
keepValidPages discards XHTML
#252
anjackson
closed
8 years ago
1
java.lang.NullPointerException on Collection
#251
ianmilligan1
closed
8 years ago
6
use WET files from CommonCrawl
#250
dportabella
opened
8 years ago
7
updated dependencies' versions
#249
dportabella
opened
8 years ago
2
running a spark application fails on EC2 with warcbase dependecy
#248
dportabella
opened
8 years ago
0
How to load an input from S3?
#247
dportabella
opened
8 years ago
1
java.lang.OutOfMemoryError: Java heap space
#246
dportabella
opened
8 years ago
6
fail on declaring a dependency on warcbase-core in a SBT project
#245
dportabella
opened
8 years ago
1
java.util.zip.ZipException: invalid distance code
#244
ianmilligan1
closed
8 years ago
4
Crawl Visualization
#243
ianmilligan1
closed
7 years ago
3
Multiple partitions
#242
youngbink
closed
8 years ago
0
Changed output type to rdd
#241
youngbink
closed
8 years ago
0
Should we periodically release pre-built binaries/jars?
#240
ibnesayeed
opened
8 years ago
12
Dockerize Warcbase
#239
ianmilligan1
closed
7 years ago
11
extract (url, plain text) of rdds
#238
youngbink
closed
8 years ago
0
Checksum
#237
youngbink
closed
8 years ago
0
Trantor upgraded to CDH 5.7.1
#236
lintool
closed
8 years ago
1
Break Warcbase up into sub-artifacts
#235
lintool
closed
8 years ago
3
Error handling for broken ARC/WARC files
#234
ianmilligan1
closed
8 years ago
10
Maven error
#233
drjwbaker
opened
8 years ago
17
Built-in Image URL building from wayback
#232
greebie
closed
7 years ago
6
Upgrade to Spark 1.6.1?
#231
ianmilligan1
closed
8 years ago
7
K means
#230
youngbink
closed
6 years ago
0
Pagerank
#229
youngbink
closed
8 years ago
0
Adding keepContent to warcbase
#228
ianmilligan1
closed
8 years ago
1
Issues with serialization on persistance
#227
bzz
opened
8 years ago
2
K-Means Clustering
#226
ianmilligan1
opened
8 years ago
4
More robust tweet parsing
#225
lintool
closed
8 years ago
1
Process.py needs to be redone for Spark (designed for Pig)
#224
ianmilligan1
closed
8 years ago
3
Non-Critical Error while Building Warcbase
#223
ianmilligan1
closed
8 years ago
1
java.lang.NegativeArraySizeException
#222
ianmilligan1
closed
8 years ago
15
New getCrawlmonth function
#221
ianmilligan1
closed
8 years ago
6
Contributing Guidelines for Warcbase
#220
ianmilligan1
closed
8 years ago
0
Contributing Guidelines for Warcbase
#219
ianmilligan1
closed
8 years ago
0
Documenting D3.js Link Visualization
#218
ianmilligan1
closed
8 years ago
1
New Twitter Features: Few Suggestions, Request for Further Suggestions
#217
ianmilligan1
opened
8 years ago
1
Tweet URL Extraction: All Twitter Shortlinks
#216
ianmilligan1
opened
8 years ago
13
Freeze on master until 15 April
#215
ianmilligan1
closed
8 years ago
0
Example counting prevalence of tweeted images
#214
lintool
closed
8 years ago
1
Link Visualization to Live in Warcbase
#213
ianmilligan1
closed
8 years ago
0
Next