meilisearch / docs-scraper

Scrape documentation into Meilisearch
https://www.meilisearch.com
Other
289 stars 49 forks source link

null byte issue #474

Open kijung-iM opened 4 months ago

kijung-iM commented 4 months ago

Description There is a problem with null byte characters being inserted in HTML pages created with Docusaurus when the language is cjk. Of course, the issue mentioned is also registered as an issue in Docusaurus.

When I scrape that page with docs-scraper, I run into the problem that it doesn't scrape anything. Logic to replace null byte characters is required.

example site:

Docs-Scraper: https://docs.whatap.io/java/agent-load-amount 0 records) Docs-Scraper: https://docs.whatap.io/java/agent-dbsql 0 records) Docs-Scraper: https://docs.whatap.io/java/agent-apdex 0 records)

I proceeded with the work by modifying the files as shown below. Please refer to the information below and correct it for the better.

documentation_spider.py:162

def parse_from_sitemap(self, response):
    if self.reason_to_stop is not None:
        raise CloseSpider(reason=self.reason_to_stop)

    # remove null byte
    response_text = response.text.replace('\u0000', '')

    if (not self.force_sitemap_urls_crawling) and (
            not self.is_rules_compliant(response)):
        print("\033[94m> Ignored from sitemap:\033[0m " + response.url)
    else:
        # self.add_records(response, from_sitemap=True)
        self.add_records(response.replace(body=response_text), from_sitemap=True)
        # We don't return self.parse(response) in order to avoid crawling those web page

def parse_from_start_url(self, response):
    if self.reason_to_stop is not None:
        raise CloseSpider(reason=self.reason_to_stop)

    # remove null byte
    response_text = response.text.replace('\u0000', '')

    if self.is_rules_compliant(response):
        self.add_records(response, from_sitemap=False)
    else:
        print("\033[94m> Ignored: from start url\033[0m " + response.url)

    # return self.parse(response)
    return self.parse(response.replace(body=response_text))

custom_downloader_middleware.py:37

body = self.driver.page_source.encode('utf-8')
# remove null byte
body = self.driver.page_source.replace('\u0000', '')
body = body.encode('utf-8')  # UTF-8 encoding
url = self.driver.current_url

default_strategy.py:37

if self._body_contains_stop_content(response):
    return []

# remove null byte
cleaned_body = response.text.replace('\u0000', '')

self.dom = self.get_dom(response.replace(body=cleaned_body.encode('utf-8')))
self.dom = self.remove_from_dom(self.dom, self.config.selectors_exclude)

records = self.get_records_from_dom(response.url)
return records
tats-u commented 1 month ago

Issue in Docusaurus: https://github.com/facebook/docusaurus/issues/9985

tats-u commented 1 month ago

Possibly related to https://github.com/scrapy/parsel/issues/123