rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.
https://rajatomar788.github.io/pywebcopy/
Other
527 stars 106 forks source link

Overwrite only if file changed mode #41

Open afonari opened 4 years ago

afonari commented 4 years ago

Is it possible to only overwrite the file if the file changed since the last crawl?

rajatomar788 commented 4 years ago

I don't seriously think it is possible in my capacity. If anyone has suggestions then I can sure implement it.

BradKML commented 1 year ago

Answer: this is not possible with merely checking URLs, but it is likely that the multimedia files do not change often, so it is likely that having a "do not update" list for multimedia would be more useful.

Instead for text pages, it would be more useful to first get the page creation date being touched. See here and here for reference. (It could be inaccurate however) In Python there is a solution with urllib

from urllib.request import urlopen
urlopen("http://example.com").headers['last-modified']

Some other people have recommended the use of checksum instead, but that poses a risk on dynamically generated websites (especially with ads) that have content that constantly mutates (e.g. recommended reading lists).

There is no perfect solution, a person would have to make a sound judgement as to see which one is better.