Closed lintool closed 9 years ago
I was looking into what the serializers are necessary for, and found out
even with them, the ARCRecord
or WARCRecord
cannot properly handle a
getContentString
outside of the RDD because the stream has already closed.
Instead of having a wrapper around ARC and WARC records, perhaps it may be better to implement a loader more similar to the Pig loaders and create a new (serializable) class containing relevant values? Then we can inspect any record without dealing with serialization.
Does that sound feasible?
On Thu, Dec 3, 2015 at 12:54 PM Jimmy Lin notifications@github.com wrote:
Assigned #186 https://github.com/lintool/warcbase/issues/186 to @aliceranzhou https://github.com/aliceranzhou.
— Reply to this email directly or view it on GitHub https://github.com/lintool/warcbase/issues/186#event-481735388.
I was looking into what the serializers are necessary for, and found out even with them, the
ARCRecord
orWARCRecord
cannot properly handle agetContentString
outside of the RDD because the stream has already closed.
But we have Writable versions of WARC and ARC records (e.g., ArcRecordWritable
), which already defines a serialization protocol... it makes a separate copy of the stream so it can re-read again.
See a bit backward to define a separate set of APIs just for this (seemingly) simple task. I found this: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala
Does this help?
For your purposes, it might be possible to use a simple wrapper that refers to the WARC/ARC record by reference - e.g. a HDFS path, and an offset - rather than wrapping the Inputstream. When you need to get the content, you re-open the file at that point and stream it in.
@anjackson but that's (basically) developing a parallel set of APIs, which I don't want!
I want to be able to go through normal Spark manipulations and end up with the raw record on console... that I can then examine, manipulate!
I meant altering your record readers, not making new ones. Pass the Path and the offset to ArcRecordWritable
(instead of the ARCRecord
), and re-create the ARCRecord
when ArcRecordWritable.getRecord
is called rather than in WacArcInputFormat.initialise
. I'd assumed this would be functionally equivalent as the API is unchanged - it's just moving the ARCRecord
initialisation code so there's a chance of re-opening the InputStream
.
But I I'm barking up the wrong tree here, as your RecordTransformers
are built on ARCRecord
not ArcRecordWritable
, so you won't get a chance to call getRecord(). Looks like I should go back to yelling at Heritrix.
Thanks for the suggestions! @lintool please review this pull request?
Fixed commit ba2b44c52b35cb026c5665f250d77f4c9b586a80
@aliceranzhou I noticed you have in a few places this magical incantation that allows the records to be serialized:
For example, see #160
Would it make sense to propagate up to the data loading API itself so we don't have to cargo cult it? When you do that, please also document why you need these serializers...