meilisearch / docs-scraper

Scrape documentation into Meilisearch
https://www.meilisearch.com
Other
286 stars 49 forks source link

Ability to update documents only from the same domain #20

Closed renehernandez closed 4 years ago

renehernandez commented 4 years ago

I have setup Meilisearch for a multi-site search frontend recently at work and I am using the docs-scraper image to push the scraped data from each of the sites to the Meilisearch server. Out-of-box, it works great!!

As we are planning to index more documents from different internal websites, I am running into the problem of needing to use different configurations files to scrape sites with different layouts. As of the current scraping logic, this is not supported because every time that the scrape image run it deletes everything from the docs index.

Having the ability of only updating (probably by deleting and adding) the documents that match the domain that is being scraped would allow to use multiple different config files with the scraper image

I would love to take a stab at this, but probably would need some pointers first on how to do it

curquiza commented 4 years ago

Hey @renehernandez! Thanks for using this tool! 😁

If I understand your issue, you want to scrap data from different websites. But these websites need different config files (for the scraper) and you would like to have all your documents in the same index -> which is not currently possible because, as you noticed, each time the scraper scrapes a website, it deletes the previous documents.

Indeed, the scraper is not currently used to keep data in the index at each scraping, because it cannot know if it's scraping the same website (in this case, it needs to delete all the previous documents) or a new website (and only adds the documents).

Here are the solutions I suggest for you:

I'm going to develop the second option: I don't really know your use-case yet, but you probably want to search in all your scraped data (from different websites) at the same time. So, having multiple indexes in MeiliSearch means you need to search in multiple indexes when performing requests. Currently, MeiliSearch does not provide a user-friendly way to search in multiple indexes, but it does not mean you can't πŸ™‚ The best way to do multi-index search is by calling each index independently and retrieve all the results one by one, this way the longest search doesn’t impact other searches. Anyway, the multi-search functionality should be added soon because you're not the first one needed it πŸ˜‡ I'll keep you in touch about that because it could be useful for you!

Hope my answer helped you! Thanks again for your feedback and for using MeiliSearch 😁

renehernandez commented 4 years ago

@curquiza Thanks for the detailed answer! Yes, as you said, my problem with multiple indexes is that is not easy to search against all of them at the same time. The problem is compounded since I am using docs-searchbar.js to provide a search endpoint and it only supports specifying a single index name to search against.

The problem is not different websites per se, it is different websites having different layouts, so the selectors section cannot be reused to scrape the data across all sites. Therefore, it requires me to have different config files

Follow-up points

I can see 2 variants that I would like to propose:

{
  "index_uid": "docs",
  "start_urls": [
    "https://example1.com",
    "https://example2.com",
  ],
  "selectors": [
  {
    "site_urls": ["https://example1.com"],
    "lvl0": {
      "selector": ".wy-menu-vertical a.current",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": ".section h1",
    "lvl2": ".section h2",
    "lvl3": ".section h3",
    "lvl4": ".section h4",
    "lvl5": ".section h5",
    "lvl6": ".section h6",
    "text": ".section p, .section li, .section blockquote, .section pre"
  },
  {
    "site_url": ["https://example2.com", "https://example2.com"],
    "lvl0": {
      "selector": ".main",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": ".article h1",
    "lvl2": ".article h2",
    "lvl3": ".article h3",
    "lvl4": ".article h4",
    "lvl5": ".articleh5",
    "lvl6": ".articleh6",
    "text": ".article p, .article li, .article blockquote, .article pre"
  }
  ]
}

The 2 suggestions above seem orthogonal to each other and I think it would add a lot of flexibility to the docs-scrape image

curquiza commented 4 years ago

The problem is not different websites per se, it is different websites having different layouts, so the selectors section cannot be reused to scrape the data across all sites. Therefore, it requires me to have different config files

Okaaaaay, hope I've better understood your problem, because I might found your solution in the config file! Here the kind of things you could do:

{
  ...
  "start_urls": [
    "http://www.example.com/docs/",
    {
      "url": "http://www.example.com/docs/concepts/",
      "selectors_key": "concepts"
    },
    {
      "url": "http://www.example.com/docs/contributors/",
      "selectors_key": "contributors"
    }
  ],
  "selectors": {
    "default": {
      "lvl0": ".main h1",
      "lvl1": ".main h2",
      "lvl2": ".main h3",
      "lvl3": ".main h4",
      "lvl4": ".main h5",
      "text": ".main p"
    },
    "concepts": {
      "lvl0": ".header h2",
      "lvl1": ".main h1.title",
      "lvl2": ".main h2.title",
      "lvl3": ".main h3.title",
      "lvl4": ".main h5.title",
      "text": ".main p"
    },
    "contributors": {
      "lvl0": ".main h1",
      "lvl1": ".contributors .name",
      "lvl2": ".contributors .title",
      "text": ".contributors .description"
    }
  }
  ...
}

Here, all documentation pages will use the selectors defined in selectors.default while the page under ./concepts will use selectors.concepts and those under ./contributors will use selectors.contributors.

I haven't taken the time yet to really test this feature (I mean, we don't use it in production), so I didn't write any documentation about that. But I plan to do it πŸ™‚

How would sitemaps help in this situation?

If you provide the sitemaps of each of your websites, you don't need selectors anymore. See the sitemap of our docs we use to scrape. The scraper will only go to the links found on the sitemaps. Not sure it's your use-case because I don't have the details, but that's a suggestion. Of course, you would need to provide all the URLs of your different websites in start_urls too. But I hope my first suggestion will be better for you


About your suggestions and if you are curious about how the scraper works, you can read the rest of my answer.

Is it really more performant to delete all the documents and re-index everything, instead of just deleting the documents that belong to a certain site and add them again with the updated content?

Yes, it's more performant doing what we do right now! The scraper does not only delete the documents at each scraping: it deletes the complete index. Why? It's faster to delete the index, and then, add all the new documents than deleting all the documents, wait, and then add the new documents. About your suggestion: checking the document one by one and delete them would take too much time. Especially because it would need another request to get the documents before deleting them.

Your first suggestion is doable πŸ™‚ Even if the config file example is not exactly right, it should be something more like this:

{
  "index_uid": "docs",
  "configs": [
    {
      "start_urls" ...,
      ...
    },
   {
      "start_urls" ...,
      ...
    }
  ]
}

As you can see, this solution would be a breakable version.


Tell me if the solution I provide fits your use case πŸ™‚

Edit

I've open an issue to update the README and provide better documentation for the scraper. See #21

renehernandez commented 4 years ago

@curquiza I will explore using named selectors with the selectors keys in the url object. I believe that would solve my use case, as long as it can be applied to completely different websites, not just different pages of the same website. I will get to it later on this week and close the issue if everything works out correctly.

curquiza commented 4 years ago

Waiting for your feedback 😁 thanks!

renehernandez commented 4 years ago

@curquiza That works just fine!! Thanks for all the help