Handle the content template (default)

With the default template, the worker will crawl the website by keeping only the page that has the same domain as urls given in parameters. It will not try to scrap the external links or files. It will also not try to scrap when pages are paginated pages (like /page/1). For each scrappable page, it will scrap the data by trying to create blocks of titles and text. Each block will contain:

h1: The title of the block
h2: The sub-title of the block
h3...h6: The sub-sub-title of the block
p: The text of the block (will create an array of text if there is multiple p in the block)
page_block: The block number of the page (starting at 0)
title: The title of the page present in the head tag
uid: a generated and incremental uid for the block
url: The URL of the page
anchor: The anchor of the block (the lower title id of the block)
meta: The meta of the page present in the head tag (JSON object containing the description, keywords, author, twitter, og, etc...)
image_url: the og:image or twitter:image if present on the page.
url_tags: the URL pathname split by / (array of string). The last element has been removed because it's the page name.

Indexed with the following settings:

{
      "searchableAttributes": [
        "h1",
        "h2",
        "h3",
        "h4",
        "h5",
        "h6",
        "p",
        "title",
        "meta.description",
      ],
      "filterableAttributes": ["urls_tags"],
      "distinctAttribute": "url",
    }

meilisearch / scrapix

Handle the content template (default) #6