gildas-lormeau / single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
GNU Affero General Public License v3.0
647 stars 63 forks source link

Crawling when passing a list of URLs only works for the first URL #128

Closed TheBv closed 1 month ago

TheBv commented 1 month ago

When crawling a list of different urls and passing it via the parameter e.g.

./single-file-x86_64-linux --urls-file=list-urls.txt --crawl-replace-URLs=true --crawl-links=true

Single File only crawls the "very first url" in the list. After some triaging I found the issue to be related to these parts of the code: https://github.com/gildas-lormeau/single-file-cli/blob/19738899d111ee154d4141b07c7c71842705fce5/single-file-cli-api.js#L165 https://github.com/gildas-lormeau/single-file-cli/blob/19738899d111ee154d4141b07c7c71842705fce5/single-file-cli-api.js#L198

In here we're checking if the url is an innerLink based on the root task. But this will basically never be the case. Personally I "adjusted" this by using the parentTask instead and things are now working as I expect them too.

There most likely is a reason why rootTask is being used here which I'm not aware of, and maybe this is intended behavior too.

But I could see how other people might expect the same outcome as me when passing in these parameters.

gildas-lormeau commented 1 month ago

Thank you very much for the detailed report. I was able to reproduce and fix the issue. The fix will be available in the next version.