meilisearch / scrapix

MIT License
23 stars 9 forks source link

Handle the content template (default) #6

Closed qdequele closed 1 year ago

qdequele commented 1 year ago

With the default template, the worker will crawl the website by keeping only the page that has the same domain as urls given in parameters. It will not try to scrap the external links or files. It will also not try to scrap when pages are paginated pages (like /page/1). For each scrappable page, it will scrap the data by trying to create blocks of titles and text. Each block will contain:

Indexed with the following settings:

{
      "searchableAttributes": [
        "h1",
        "h2",
        "h3",
        "h4",
        "h5",
        "h6",
        "p",
        "title",
        "meta.description",
      ],
      "filterableAttributes": ["urls_tags"],
      "distinctAttribute": "url",
    }