rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.
https://rajatomar788.github.io/pywebcopy/
Other
520 stars 105 forks source link

Encoding issue #93

Closed pbtsrc closed 2 years ago

pbtsrc commented 2 years ago

The symbol '“' (Left Double Quotation Mark, U+201C) in the html code on a web-page becomes “ after downloading by save_complete(). It seems the same problem is described here: https://stackoverflow.com/a/52615216 I tried to replace in pywebcopy.webpage self.set_source(req.raw, req.encoding) by self.set_source(req.raw) and the problem went away.

rajatomar788 commented 2 years ago

You can rectify this issue by setting the encoding attribute manually. Did you try it or not?

pbtsrc commented 2 years ago

Could you please point out where I can set the encoding attribute? I use this code to get a web page :

pywebcopy.config.setup_config(url, project_dir, project_name, over_write=True,)
wp = pywebcopy.WebPage()
wp.get(url)
wp.save_complete()
rajatomar788 commented 2 years ago
pywebcopy.config.setup_config(url, project_dir, project_name, over_write=True,)
wp = pywebcopy.WebPage()
wp.get(url)

# here set the encoding
wp.encoding = 'utf-8'

wp.save_complete()
pbtsrc commented 2 years ago

Yes sure, this works. But only for UTF-8-encoded pages. Unfortunately we need to support different encodings. But I tried to set wp.encoding = None and it seems it works as expected now. Probably the requests now tries to detect page encoding by its content. I checked that on several pages with different encodings and it works without errors. Thank you.

rajatomar788 commented 2 years ago

Yeah it was designed to work this way. You should read the documentation more and feel free to add the part which you think should have been explained.

malone6 commented 1 year ago

The above discussion is a setting method for single page(WebPage.encoding). Now I use pywebcopy.save_website(**kwargs...), could you please point out where I can set the encoding attribute?

rajatomar788 commented 1 year ago

@malone6 the config has encoding key which could be set for desired encoding.