Extract URLs from singlefilez files

gildas-lormeau / SingleFileZ

Web Extension to save a faithful copy of an entire web page in a self-extracting ZIP file

GNU Affero General Public License v3.0

1.82k stars 140 forks source link

Extract URLs from singlefilez files #158

Closed ghbook closed 1 year ago

ghbook commented 1 year ago

I have bunch of .zip.html files downloaded using singlefilez web extension over the years. I want to extract URLs (href) from these files via nodejs script to a txt file. How to do it since these are not typical html files but zips.

gildas-lormeau commented 1 year ago

You have to unzip the zip.html files for example with zip.js, see https://github.com/gildas-lormeau/zip.js. The URL can be found in the manifest.json file located in the root folder of the zip file, see originalUrl.

gildas-lormeau commented 1 year ago

FYI, here is a simple solution for obtaining the URLs from the command line on Linux/WSL/macOS (with unzip and jq installed):

for name in *.zip.html; do echo -n "$name: "; unzip -p $name manifest.json | jq ".originalUrl"; done

gildas-lormeau commented 1 year ago

I'm moving this issue in the "Discussions" tab