lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

WARCRecord NotSerializableException when trying to get rid of duplicate pages #260

Open dportabella opened 7 years ago

dportabella commented 7 years ago

I try to get rid of duplicate pages as follows:

val r = RecordLoader.loadArchives("/directory/to/arc/file.arc.gz", sc) 
.keepValidPages()
.groupBy(_.getUrl).values.map(_.head)  // remove duplicates
.map(r => r.getUrl)
.take(10)

but I get this exception:
java.io.NotSerializableException: org.archive.io.warc.WARCRecord
Serialization stack:
    - object not serializable (class: org.archive.io.warc.WARCRecord, value: org.archive.io.warc.WARCRecord@28158a29)
    - field (class: org.warcbase.spark.archive.io.GenericArchiveRecord, name: warcRecord, type: class org.archive.io.warc.WARCRecord)
    - object (class org.warcbase.spark.archive.io.GenericArchiveRecord, org.warcbase.spark.archive.io.GenericArchiveRecord@4258e51d)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)

Any idea? or how to achieve the same objective?

dportabella commented 7 years ago

I am currently doing as follows:

val r = RecordLoader.loadArchives("/directory/to/arc/file.arc.gz", sc) 
.keepValidPages()
.map(r => (r.getUrl, r.getContentString))
.reduceByKey { case (contentString1, contentString2) => contentString1 }
...