Description
There is a problem with null byte characters being inserted in HTML pages created with Docusaurus when the language is cjk. Of course, the issue mentioned is also registered as an issue in Docusaurus.
When I scrape that page with docs-scraper, I run into the problem that it doesn't scrape anything. Logic to replace null byte characters is required.
I proceeded with the work by modifying the files as shown below. Please refer to the information below and correct it for the better.
documentation_spider.py:162
def parse_from_sitemap(self, response):
if self.reason_to_stop is not None:
raise CloseSpider(reason=self.reason_to_stop)
# remove null byte
response_text = response.text.replace('\u0000', '')
if (not self.force_sitemap_urls_crawling) and (
not self.is_rules_compliant(response)):
print("\033[94m> Ignored from sitemap:\033[0m " + response.url)
else:
# self.add_records(response, from_sitemap=True)
self.add_records(response.replace(body=response_text), from_sitemap=True)
# We don't return self.parse(response) in order to avoid crawling those web page
def parse_from_start_url(self, response):
if self.reason_to_stop is not None:
raise CloseSpider(reason=self.reason_to_stop)
# remove null byte
response_text = response.text.replace('\u0000', '')
if self.is_rules_compliant(response):
self.add_records(response, from_sitemap=False)
else:
print("\033[94m> Ignored: from start url\033[0m " + response.url)
# return self.parse(response)
return self.parse(response.replace(body=response_text))
custom_downloader_middleware.py:37
body = self.driver.page_source.encode('utf-8')
# remove null byte
body = self.driver.page_source.replace('\u0000', '')
body = body.encode('utf-8') # UTF-8 encoding
url = self.driver.current_url
default_strategy.py:37
if self._body_contains_stop_content(response):
return []
# remove null byte
cleaned_body = response.text.replace('\u0000', '')
self.dom = self.get_dom(response.replace(body=cleaned_body.encode('utf-8')))
self.dom = self.remove_from_dom(self.dom, self.config.selectors_exclude)
records = self.get_records_from_dom(response.url)
return records
Description There is a problem with null byte characters being inserted in HTML pages created with Docusaurus when the language is cjk. Of course, the issue mentioned is also registered as an issue in Docusaurus.
When I scrape that page with docs-scraper, I run into the problem that it doesn't scrape anything. Logic to replace null byte characters is required.
example site:
I proceeded with the work by modifying the files as shown below. Please refer to the information below and correct it for the better.
documentation_spider.py:162
custom_downloader_middleware.py:37
default_strategy.py:37