lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

loadWarc generating empty arrays #166

Closed ianmilligan1 closed 8 years ago

ianmilligan1 commented 8 years ago

We might have a potential bug (or it could be a misunderstanding of the code base). This script works well with an ARC file, using loadArc:

val r = 
RecordLoader.loadArc(arc, 
sc) 
.keepValidPages() 
.map(r => ExtractTopLevelDomain(r.getUrl)) 
.countItems() 
.take(10) 

But WARC files fail when replacing loadArc with loadWarc:

val r = 
RecordLoader.loadWarc(warc, 
sc) 
.keepValidPages() 
.map(r => ExtractTopLevelDomain(r.getUrl)) 
.countItems() 
.take(10) 

results in an empty array.

Using sample data uploaded here.

aliceranzhou commented 8 years ago

It was a bug indeed. I've merged the fix.

ianmilligan1 commented 8 years ago

Great stuff, @aliceranzhou – just confirmed that it's all working as it should. Many thanks.

ruebot commented 8 years ago

Thanks @aliceranzhou!