Ability to update documents only from the same domain

renehernandez commented 4 years ago

I have setup Meilisearch for a multi-site search frontend recently at work and I am using the docs-scraper image to push the scraped data from each of the sites to the Meilisearch server. Out-of-box, it works great!!

As we are planning to index more documents from different internal websites, I am running into the problem of needing to use different configurations files to scrape sites with different layouts. As of the current scraping logic, this is not supported because every time that the scrape image run it deletes everything from the docs index.

Having the ability of only updating (probably by deleting and adding) the documents that match the domain that is being scraped would allow to use multiple different config files with the scraper image

I would love to take a stab at this, but probably would need some pointers first on how to do it

curquiza commented 4 years ago

Hey @renehernandez! Thanks for using this tool! 😁

If I understand your issue, you want to scrap data from different websites. But these websites need different config files (for the scraper) and you would like to have all your documents in the same index -> which is not currently possible because, as you noticed, each time the scraper scrapes a website, it deletes the previous documents.

Indeed, the scraper is not currently used to keep data in the index at each scraping, because it cannot know if it's scraping the same website (in this case, it needs to delete all the previous documents) or a new website (and only adds the documents).

Here are the solutions I suggest for you:

I don't know if you use sitemaps to scrap all your websites but it can be an alternative to avoid using different config files. There is a sitemaps field and it takes an array as value.
If you cannot apply the first option, you have to define an index for each website. You can indeed have multiple indexes in the same MeiliSearch instance (here are links from our docs to API References and to the Main Concept). The field index_uid in the scraper config file is there to know in which index you want to store your data.

I'm going to develop the second option: I don't really know your use-case yet, but you probably want to search in all your scraped data (from different websites) at the same time. So, having multiple indexes in MeiliSearch means you need to search in multiple indexes when performing requests. Currently, MeiliSearch does not provide a user-friendly way to search in multiple indexes, but it does not mean you can't 🙂 The best way to do multi-index search is by calling each index independently and retrieve all the results one by one, this way the longest search doesn’t impact other searches. Anyway, the multi-search functionality should be added soon because you're not the first one needed it 😇 I'll keep you in touch about that because it could be useful for you!

Hope my answer helped you! Thanks again for your feedback and for using MeiliSearch 😁

renehernandez commented 4 years ago

@curquiza Thanks for the detailed answer! Yes, as you said, my problem with multiple indexes is that is not easy to search against all of them at the same time. The problem is compounded since I am using docs-searchbar.js to provide a search endpoint and it only supports specifying a single index name to search against.

The problem is not different websites per se, it is different websites having different layouts, so the selectors section cannot be reused to scrape the data across all sites. Therefore, it requires me to have different config files

Follow-up points

How would sitemaps help in this situation?
Is it really more performant to delete all the documents and re-index everything, instead of just deleting the documents that belong to a certain site and add them again with the updated content?

I can see 2 variants that I would like to propose:

The selectors field could be changed so it supports multiple configurations, as shown below, where each configuration will be used if the currently scraped site is contained in the site_urls field. As a first iteration, this one wouldn't required to handle selected deletions and additions

{
  "index_uid": "docs",
  "start_urls": [
    "https://example1.com",
    "https://example2.com",
  ],
  "selectors": [
  {
    "site_urls": ["https://example1.com"],
    "lvl0": {
      "selector": ".wy-menu-vertical a.current",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": ".section h1",
    "lvl2": ".section h2",
    "lvl3": ".section h3",
    "lvl4": ".section h4",
    "lvl5": ".section h5",
    "lvl6": ".section h6",
    "text": ".section p, .section li, .section blockquote, .section pre"
  },
  {
    "site_url": ["https://example2.com", "https://example2.com"],
    "lvl0": {
      "selector": ".main",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": ".article h1",
    "lvl2": ".article h2",
    "lvl3": ".article h3",
    "lvl4": ".article h4",
    "lvl5": ".articleh5",
    "lvl6": ".articleh6",
    "text": ".article p, .article li, .article blockquote, .article pre"
  }
  ]
}

Implement deleting the documents matching the url of the current site being scraped and the adding the new documents to index created from the scrape operation. That way only the documents associated with a particular site would be affected, instead of the whole index. This behavior could be enabled based on a flag, which will maintain the current default behavior the same.

The 2 suggestions above seem orthogonal to each other and I think it would add a lot of flexibility to the docs-scrape image

curquiza commented 4 years ago

The problem is not different websites per se, it is different websites having different layouts, so the selectors section cannot be reused to scrape the data across all sites. Therefore, it requires me to have different config files

Okaaaaay, hope I've better understood your problem, because I might found your solution in the config file! Here the kind of things you could do:

{
  ...
  "start_urls": [
    "http://www.example.com/docs/",
    {
      "url": "http://www.example.com/docs/concepts/",
      "selectors_key": "concepts"
    },
    {
      "url": "http://www.example.com/docs/contributors/",
      "selectors_key": "contributors"
    }
  ],
  "selectors": {
    "default": {
      "lvl0": ".main h1",
      "lvl1": ".main h2",
      "lvl2": ".main h3",
      "lvl3": ".main h4",
      "lvl4": ".main h5",
      "text": ".main p"
    },
    "concepts": {
      "lvl0": ".header h2",
      "lvl1": ".main h1.title",
      "lvl2": ".main h2.title",
      "lvl3": ".main h3.title",
      "lvl4": ".main h5.title",
      "text": ".main p"
    },
    "contributors": {
      "lvl0": ".main h1",
      "lvl1": ".contributors .name",
      "lvl2": ".contributors .title",
      "text": ".contributors .description"
    }
  }
  ...
}

Here, all documentation pages will use the selectors defined in selectors.default while the page under ./concepts will use selectors.concepts and those under ./contributors will use selectors.contributors.

I haven't taken the time yet to really test this feature (I mean, we don't use it in production), so I didn't write any documentation about that. But I plan to do it 🙂

How would sitemaps help in this situation?

If you provide the sitemaps of each of your websites, you don't need selectors anymore. See the sitemap of our docs we use to scrape. The scraper will only go to the links found on the sitemaps. Not sure it's your use-case because I don't have the details, but that's a suggestion. Of course, you would need to provide all the URLs of your different websites in start_urls too. But I hope my first suggestion will be better for you

About your suggestions and if you are curious about how the scraper works, you can read the rest of my answer.

Is it really more performant to delete all the documents and re-index everything, instead of just deleting the documents that belong to a certain site and add them again with the updated content?

Yes, it's more performant doing what we do right now! The scraper does not only delete the documents at each scraping: it deletes the complete index. Why? It's faster to delete the index, and then, add all the new documents than deleting all the documents, wait, and then add the new documents. About your suggestion: checking the document one by one and delete them would take too much time. Especially because it would need another request to get the documents before deleting them.

Your first suggestion is doable 🙂 Even if the config file example is not exactly right, it should be something more like this:

{
  "index_uid": "docs",
  "configs": [
    {
      "start_urls" ...,
      ...
    },
   {
      "start_urls" ...,
      ...
    }
  ]
}

As you can see, this solution would be a breakable version.

Tell me if the solution I provide fits your use case 🙂

Edit

I've open an issue to update the README and provide better documentation for the scraper. See #21

renehernandez commented 4 years ago

@curquiza I will explore using named selectors with the selectors keys in the url object. I believe that would solve my use case, as long as it can be applied to completely different websites, not just different pages of the same website. I will get to it later on this week and close the issue if everything works out correctly.

curquiza commented 4 years ago

Waiting for your feedback 😁 thanks!

renehernandez commented 4 years ago

@curquiza That works just fine!! Thanks for all the help

meilisearch / docs-scraper

Ability to update documents only from the same domain #20

Edit