Open traverseda opened 5 years ago
Also, this implies to me that even if two files are identical and have identical URLs, a dedupe record is still written? Why is that?
It looks like the dedupe file is used again, but the warc file is being created from scratch. That's definitely not was I would expect, is that how it's supposed to work? If you're recreating the warc file, shouldn't you be recreating the DB as well?
I'm not sure I understand these questions? Warcprox has no conception of "recreating the warc file".
Also, this implies to me that even if two files are identical and have identical URLs, a dedupe record is still written? Why is that?
See https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit
Closing as there doesn't seem to be an issue reported here.
#first run
1.7M warc_cache/warcs/book.pythontips.com.warc.gz
#Second run, exact same code
516K warc_cache/warcs/book.pythontips.com.warc.gz
It is the same code being run twice. Can you explain why the files are different sizes? And why the warc file is smaller the second time?
Deduplication, presumably. The fact that it went back to the original size after you deleted the dedup db strongly corroborates this. You can also look in the warcs to see what's inside there...
Notice how it is actually smaller on the second run. I've confirmed that that isn't because of compression.
Is the deduplication run out-of-band? How can the file become smaller the second time I run the command?
Deduplication means you don't save a second copy of something if you already have it. The second warc being smaller is the whole point.
So yes, it is deleting the original warc file and creating a new one, instead of appending the new results on to the end of the old one.
That's the intended behavior?
Perhaps it could copy the old warc file to mywarc.0.warc
or something? That behavior is not explicit and I found it to be very confusing. I had presumed it was a bug, and it took me a while to track down the issue.
Uhhh. Oh, now I understand the confusion. I had assumed that you had renamed your warcs from warcprox's default naming scheme to book.pythontips.com.warc.gz. Normally warcprox names its warcs such that it basically guarantees uniqueness. But I guess you are using --warc-filename
and not using any of the {variables}
. The bug you're reporting is that in case of a filename collision, which you can reproduce easily using--warc-filename
, the old file gets clobbered. Ok, that's a legitimate bug.
Warcprox is not designed to write to a single warc. It rolls over to a new warc when the active warc reaches a configurable size, or a configurable time since the last write has elapsed.
I'm thinking we should rename --warc-filename
to --warc-filename-template
and require at least one of {timestamp17}
and {serialno}
, and probably panic and die in case of a filename collision.
I'm thinking we should rename --warc-filename to --warc-filename-template and require at least one of {timestamp17} and {serialno}, and probably panic and die in case of a filename collision.
@vbanos since --warc-filename is your feature, do you have time to implement this improvement? 😃
Yeah, that would be a lot less confusing.
Panicking and dieing in the case of filename collision would probably be fine, that would have forced me to read the docs more. I was operating under the assumption that the warc files were essentially an append-only log.
My bad, but it took an embarrassingly long time to notice there was a problem while I was busy dealing with selenium issues.
It looks like the dedupe file is used again, but the warc file is being created from scratch. That's definitely not was I would expect, is that how it's supposed to work? If you're recreating the warc file, shouldn't you be recreating the DB as well?