Encoding issues for websites in non-English languages such as Chinese, Japanese, etc.

rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.

https://rajatomar788.github.io/pywebcopy/

Other

527 stars 106 forks source link

Encoding issues for websites in non-English languages such as Chinese, Japanese, etc. #64

Open gaowanliang opened 3 years ago

gaowanliang commented 3 years ago

The encoding of the downloaded website is a Unicode Numeric character reference, and this encoding does not display the real content in the browser

mima3 commented 2 years ago

I got around this by the following method.

create new class that inherits from WebPage
create new save_html

#略
            root.getroottree().write(file_name, method="html", encoding=self.encoding)
#略

rajatomar788 commented 2 years ago

@mima3 this is one of the ways to do it. The other being changing the .encoding attribute of the WebPage object.

muzicstation commented 1 year ago

Is there a valid method for ver 7.0 or later versions?

BradKML commented 1 year ago

Is it possible to just check the encoding of the webpage based on what they claim? There are two major ways of getting the encoding to decode.

https://stackoverflow.com/a/19156107 resource.headers.get_content_charset()
https://stackoverflow.com/a/38807852 page.info().getparam('charset')

rajatomar788 commented 1 year ago

@BrandonKMLee your first example works on top of the second one so they are not two separate things. And also majority of the times the encoding reported by website are wrong so it is always a trial and to find the best encoding on the user side.

BradKML commented 1 year ago

@rajatomar788 in that case wound need to run through Python Chatdet or cChardet to "smell" the text, even if it is not a guarantee it is a good default to have.

PeterBon commented 1 year ago

Is there a valid method for ver 7.0 or later versions?

我通过修改schedulers.py解决了：

class Scheduler(SchedulerBase):
    def _handle_resource(self, resource):
        try:
            self.logger.debug('Scheduler trying to get resource at: [%s]' % resource.url)
            resource.get(resource.context.url)
            # NOTE :meth:`get` can change the :attr:`filepath` of the resource
            resource.encoding = 'utf-8'  # 这里添加一行
            self.index.add_resource(resource)
        except ConnectionError:
            self.logger.error(
                "Scheduler ConnectionError Failed to retrieve resource from [%s]"
                % resource.url)
            # self.index.add_entry(resource.url, resource.filepath)
        except Exception as e:
            self.logger.exception(e)
            # self.index.add_entry(resource.url, resource.filepath)
        else:
            self.logger.debug('Scheduler running handler for: [%s]' % resource.url)
            resource.retrieve()
        self.index.add_resource(resource)