Open gaowanliang opened 3 years ago
I got around this by the following method.
#略
root.getroottree().write(file_name, method="html", encoding=self.encoding)
#略
@mima3 this is one of the ways to do it. The other being changing the .encoding
attribute of the WebPage
object.
Is there a valid method for ver 7.0 or later versions?
Is it possible to just check the encoding of the webpage based on what they claim? There are two major ways of getting the encoding to decode.
resource.headers.get_content_charset()
page.info().getparam('charset')
@BrandonKMLee your first example works on top of the second one so they are not two separate things. And also majority of the times the encoding reported by website are wrong so it is always a trial and to find the best encoding on the user side.
@rajatomar788 in that case wound need to run through Python Chatdet or cChardet to "smell" the text, even if it is not a guarantee it is a good default to have.
Is there a valid method for ver 7.0 or later versions?
我通过修改schedulers.py解决了:
class Scheduler(SchedulerBase):
def _handle_resource(self, resource):
try:
self.logger.debug('Scheduler trying to get resource at: [%s]' % resource.url)
resource.get(resource.context.url)
# NOTE :meth:`get` can change the :attr:`filepath` of the resource
resource.encoding = 'utf-8' # 这里添加一行
self.index.add_resource(resource)
except ConnectionError:
self.logger.error(
"Scheduler ConnectionError Failed to retrieve resource from [%s]"
% resource.url)
# self.index.add_entry(resource.url, resource.filepath)
except Exception as e:
self.logger.exception(e)
# self.index.add_entry(resource.url, resource.filepath)
else:
self.logger.debug('Scheduler running handler for: [%s]' % resource.url)
resource.retrieve()
self.index.add_resource(resource)
The encoding of the downloaded website is a Unicode Numeric character reference, and this encoding does not display the real content in the browser