harvard-lil / js-wacz

JavaScript module and CLI tool for working with web archive data using the WACZ format specification.
MIT License
11 stars 4 forks source link

CLI should return exit code != 0 on blocking error #97

Closed matteocargnelutti closed 3 months ago

matteocargnelutti commented 3 months ago

@tw4l If you have a moment, it'd be cool to get your input on this PR.

Specifically, I wonder if the CLI should return exit code 1 when it fails to process user-provided CDXJ.

tw4l commented 3 months ago

Specifically, I wonder if the CLI should return exit code 1 when it fails to process user-provided CDXJ.

Yes, I think this would be a good idea. On a related note, I wonder if this processing should be within the WACZ class itself rather than in the CLI, as it currently means if we want to use the WACZ class in node, we have to duplicate a lot of the CDXJ-related code from bin/cli.js rather than just calling the class with the path to the CDXJ directory. Probably best handled in a separate PR however (I know Ilya also had concerns about handling the CDX in-memory, as some of our crawls can get quite large and might OOM).

matteocargnelutti commented 3 months ago

Thanks @tw4l ! I added exit code 1 for that case.


On a related note, I wonder if this processing should be within the WACZ class itself rather than in the CLI, as it currently means if we want to use the WACZ class in node, we have to duplicate a lot of the CDXJ-related code from bin/cli.js rather than just calling the class with the path to the CDXJ directory. Probably best handled in a separate PR however (I know Ilya also had concerns about handling the CDX in-memory, as some of our crawls can get quite large and might OOM).

My understanding is that, in that context, you probably want to be able to inject pre-processed index.cdx and index.cdx.gz files directly into the resulting ZIP. I could see that being an option on the WACZ class. Maybe indexes, which would be a path to a folder containing .cdx/ .cdx.gz files.

If provided:

We could them hoist that feature to the CLI so there's no code duplication.

Something along those lines maybe? Happy to keep that conversation going in a dedicated issue and/or PR :).