lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Make WARecord serializable with KyroSerializer #188

Closed aliceranzhou closed 8 years ago

lintool commented 8 years ago

A few comments:

Playing with src/test/resources/arc/example.arc.gz, I'm getting:

ERROR ArcRecordUtils - Read 1224 bytes but expected 1300 bytes. Continuing...

I believe the error has always been there (something inconsistent in the data itself), but previously it's been masked by the API... so this doesn't seem like something we should worry about?

If I do:

val r = RecordLoader.loadArc("src/test/resources/arc/example.arc.gz", sc).keepValidPages().take(10)

It works, which is what I'd want, so yay! However it gives a type Array[org.warcbase.spark.matchbox.RecordTransformers.WARecord].

I'm wondering why WARecord is an inner class of RecordTransformers?

Shouldn't we actually have something like:

Note they shouldn't actually be in the matchbox package since they aren't UDFs, no? Also, ArchiveRecord instead of WARecord to clearer?

lintool commented 8 years ago

I added the API question as an issue here: https://github.com/lintool/warcbase/issues/189

So it doesn't get lost in a pull request.

aliceranzhou commented 8 years ago

Good point. I've moved the records into separate classes. The only thing is that they are no longer implicitly converted.

For the error, should we keep it as is or disable the error reporting?

lintool commented 8 years ago

The only thing is that they are no longer implicitly converted.

What are the implications of this? As in, are your scripts going to change?

For the error, should we keep it as is or disable the error reporting?

Keep it for now...

aliceranzhou commented 8 years ago

No, the scripts will stay the same. Previously, any ARCRecord or WARCRecord was implicitly converted to a WARecord. Now, the RecordLoader explicitly initializes ArchiveRecords.

On Thu, Dec 10, 2015 at 9:25 PM Jimmy Lin notifications@github.com wrote:

The only thing is that they are no longer implicitly converted.

What are the implications of this? As in, are your scripts going to change?

For the error, should we keep it as is or disable the error reporting?

Keep it for now...

— Reply to this email directly or view it on GitHub https://github.com/lintool/warcbase/pull/188#issuecomment-163816663.

lintool commented 8 years ago

Looks good, thanks! I've merged commit ba2b44c52b35cb026c5665f250d77f4c9b586a80