[x] the filename part of the archive_path value should be URL encoded. Incorrect: archive_path: tests/webpages/8364/old.reddit.com/r/gamedev/top/index.html?sort=top&t=week&limit=15.html, should be archive_path: tests/webpages/8364/old.reddit.com/r/gamedev/top/index.html%3Fsort=top&t=week&limit=15.html
[ ] importers/shaarli_api: add an option to delete archive directories when a link was removed from the imported data file
[x] allow blacklisting specific URLs use exclude_tags instead
[x] only download links with specific tags
[ ] (later) allow recursive crawling based on special tags: d2 -> --level=2, d3 -> --level=3, ...
[ ] (later) during recursive crawling, only download URLs with extensions htm, html, zip, png, jpg, wav, ogg, mp3, flac, avi, webm, ogv, mp4, pdf, css...
[ ] (later) during recursive crawling, only follow links to the same domain and/or directory
[ ] (later) allow scanning the description for http:, https:, URLs, and also download these pages
archive_path
value should be URL encoded. Incorrect:archive_path: tests/webpages/8364/old.reddit.com/r/gamedev/top/index.html?sort=top&t=week&limit=15.html
, should bearchive_path: tests/webpages/8364/old.reddit.com/r/gamedev/top/index.html%3Fsort=top&t=week&limit=15.html
allow blacklisting specific URLsuseexclude_tags
insteadd2 -> --level=2
,d3 -> --level=3
, ...htm, html, zip, png, jpg, wav, ogg, mp3, flac, avi, webm, ogv, mp4, pdf, css...
http:, https:,
URLs, and also download these pagesDNS-based ad-blocking? standalone dnsmasq:/usr/sbin/dnsmasq --cache-size=400 --keep-in-foreground --addn-hosts=adblock-hosts.txt --conf-file=/dev/null --conf-dir=/etc/NetworkManager/dnsmasq.d --port 5353 --proxy-dnssec --clear-on-reload
HTTP proxy-based ad-blocking? (squid?)(later) add readability/page alteration featuresOther solutions/implementations:
httrack −−mime−html --single-log --list urllist.txt --continue --verbose --robots=0 --index --depth=1 --ext-depth=1 --near --user-agent "Mozilla/5.0 (Windows NT 6.1; rv:49.0) Gecko/20100101 Firefox/49.0" -* +*.png +*.jpg +*.gif +*.css +*.js