lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Propagate Spark serializers to within data loading API #186

Closed lintool closed 8 years ago

lintool commented 8 years ago

@aliceranzhou I noticed you have in a few places this magical incantation that allows the records to be serialized:

       .set("spark.serializer", "org.apache.spark.serializer.KyroSerializer")
       .registerKryoClasses(...)

For example, see #160

Would it make sense to propagate up to the data loading API itself so we don't have to cargo cult it? When you do that, please also document why you need these serializers...

aliceranzhou commented 8 years ago

I was looking into what the serializers are necessary for, and found out even with them, the ARCRecord or WARCRecord cannot properly handle a getContentString outside of the RDD because the stream has already closed.

Instead of having a wrapper around ARC and WARC records, perhaps it may be better to implement a loader more similar to the Pig loaders and create a new (serializable) class containing relevant values? Then we can inspect any record without dealing with serialization.

Does that sound feasible?

On Thu, Dec 3, 2015 at 12:54 PM Jimmy Lin notifications@github.com wrote:

Assigned #186 https://github.com/lintool/warcbase/issues/186 to @aliceranzhou https://github.com/aliceranzhou.

— Reply to this email directly or view it on GitHub https://github.com/lintool/warcbase/issues/186#event-481735388.

lintool commented 8 years ago

I was looking into what the serializers are necessary for, and found out even with them, the ARCRecord or WARCRecord cannot properly handle a getContentString outside of the RDD because the stream has already closed.

But we have Writable versions of WARC and ARC records (e.g., ArcRecordWritable), which already defines a serialization protocol... it makes a separate copy of the stream so it can re-read again.

See a bit backward to define a separate set of APIs just for this (seemingly) simple task. I found this: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala

Does this help?

anjackson commented 8 years ago

For your purposes, it might be possible to use a simple wrapper that refers to the WARC/ARC record by reference - e.g. a HDFS path, and an offset - rather than wrapping the Inputstream. When you need to get the content, you re-open the file at that point and stream it in.

lintool commented 8 years ago

@anjackson but that's (basically) developing a parallel set of APIs, which I don't want!

I want to be able to go through normal Spark manipulations and end up with the raw record on console... that I can then examine, manipulate!

anjackson commented 8 years ago

I meant altering your record readers, not making new ones. Pass the Path and the offset to ArcRecordWritable (instead of the ARCRecord), and re-create the ARCRecord when ArcRecordWritable.getRecord is called rather than in WacArcInputFormat.initialise. I'd assumed this would be functionally equivalent as the API is unchanged - it's just moving the ARCRecord initialisation code so there's a chance of re-opening the InputStream.

But I I'm barking up the wrong tree here, as your RecordTransformers are built on ARCRecord not ArcRecordWritable, so you won't get a chance to call getRecord(). Looks like I should go back to yelling at Heritrix.

aliceranzhou commented 8 years ago

Thanks for the suggestions! @lintool please review this pull request?

lintool commented 8 years ago

Fixed commit ba2b44c52b35cb026c5665f250d77f4c9b586a80