internetarchive / warcprox

WARC writing MITM HTTP/S proxy
378 stars 54 forks source link

avoid clobbering existing warc #137

Open traverseda opened 5 years ago

traverseda commented 5 years ago
#first run
1.7M    warc_cache/warcs/book.pythontips.com.warc.gz
#Second run, exact same code
516K    warc_cache/warcs/book.pythontips.com.warc.gz
#Deleted dedupe but not warc file
1.7M    warc_cache/warcs/book.pythontips.com.warc.gz

It looks like the dedupe file is used again, but the warc file is being created from scratch. That's definitely not was I would expect, is that how it's supposed to work? If you're recreating the warc file, shouldn't you be recreating the DB as well?

traverseda commented 5 years ago

Also, this implies to me that even if two files are identical and have identical URLs, a dedupe record is still written? Why is that?

nlevitt commented 5 years ago

It looks like the dedupe file is used again, but the warc file is being created from scratch. That's definitely not was I would expect, is that how it's supposed to work? If you're recreating the warc file, shouldn't you be recreating the DB as well?

I'm not sure I understand these questions? Warcprox has no conception of "recreating the warc file".

Also, this implies to me that even if two files are identical and have identical URLs, a dedupe record is still written? Why is that?

See https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit

Closing as there doesn't seem to be an issue reported here.

traverseda commented 5 years ago
#first run
1.7M    warc_cache/warcs/book.pythontips.com.warc.gz
#Second run, exact same code
516K    warc_cache/warcs/book.pythontips.com.warc.gz

It is the same code being run twice. Can you explain why the files are different sizes? And why the warc file is smaller the second time?

nlevitt commented 5 years ago

Deduplication, presumably. The fact that it went back to the original size after you deleted the dedup db strongly corroborates this. You can also look in the warcs to see what's inside there...

traverseda commented 5 years ago

Notice how it is actually smaller on the second run. I've confirmed that that isn't because of compression.

Is the deduplication run out-of-band? How can the file become smaller the second time I run the command?

nlevitt commented 5 years ago

Deduplication means you don't save a second copy of something if you already have it. The second warc being smaller is the whole point.

traverseda commented 5 years ago

So yes, it is deleting the original warc file and creating a new one, instead of appending the new results on to the end of the old one.

traverseda commented 5 years ago

That's the intended behavior?

Perhaps it could copy the old warc file to mywarc.0.warc or something? That behavior is not explicit and I found it to be very confusing. I had presumed it was a bug, and it took me a while to track down the issue.

nlevitt commented 5 years ago

Uhhh. Oh, now I understand the confusion. I had assumed that you had renamed your warcs from warcprox's default naming scheme to book.pythontips.com.warc.gz. Normally warcprox names its warcs such that it basically guarantees uniqueness. But I guess you are using --warc-filename and not using any of the {variables}. The bug you're reporting is that in case of a filename collision, which you can reproduce easily using--warc-filename, the old file gets clobbered. Ok, that's a legitimate bug.

Warcprox is not designed to write to a single warc. It rolls over to a new warc when the active warc reaches a configurable size, or a configurable time since the last write has elapsed.

I'm thinking we should rename --warc-filename to --warc-filename-template and require at least one of {timestamp17} and {serialno}, and probably panic and die in case of a filename collision.

nlevitt commented 5 years ago

I'm thinking we should rename --warc-filename to --warc-filename-template and require at least one of {timestamp17} and {serialno}, and probably panic and die in case of a filename collision.

@vbanos since --warc-filename is your feature, do you have time to implement this improvement? 😃

traverseda commented 5 years ago

Yeah, that would be a lot less confusing.

Panicking and dieing in the case of filename collision would probably be fine, that would have forced me to read the docs more. I was operating under the assumption that the warc files were essentially an append-only log.

My bad, but it took an embarrassingly long time to notice there was a problem while I was busy dealing with selenium issues.