harvard-lil / js-wacz

JavaScript module and CLI tool for working with web archive data using the WACZ format specification.
MIT License
11 stars 4 forks source link

Modify --pages option to copy pages files directly into WACZ #92

Closed tw4l closed 3 months ago

tw4l commented 3 months ago

Fixes #91

Adds tests as well. Happy to make any changes you see fit. Thanks for the review!

matteocargnelutti commented 3 months ago

Hi @tw4l -- thanks for the PR!

I'm on board with the idea 🏄

Here are two things I would suggest:

What do you think?

tw4l commented 3 months ago

Hi @tw4l -- thanks for the PR!

I'm on board with the idea 🏄

Here are two things I would suggest:

* `--pages-dir` should be a replacement of `--pages`, so we don't have two features behaving in almost identical ways. I'd suggest replacing the current `--pages` option with your feature.

* I would add some data validation in `copyPagesFilesToZip()` so we don't accidentally copy over an invalid `pages.jsonl` / `extraPages.jsonl` file. For example, I'd want to be "sure" it's a series of JSON objects, and that they match the spec.

What do you think?

Those both make sense to me! Happy to push these changes tomorrow morning :)

tw4l commented 3 months ago

Good morning @matteocargnelutti !

I've gone ahead and made the changes, as well as checking for a .jsonl extension (case-insensitive) rather than checking for pages.jsonl and extraPages.jsonl specifically, to allow for future flexibility with pages filenames.

Let me know if anything!

matteocargnelutti commented 3 months ago

Thanks again for a great PR, @tw4l