meilisearch / scrapix

MIT License
21 stars 9 forks source link

Ensure same documents are not pushed more than once. #52

Open bidoubiwa opened 1 year ago

bidoubiwa commented 1 year ago

Context

Some websites have multiple URL's pointing to the same page. For example in openai:

Problem

Since the crawler is not able to know it already scrapped those pages, it will scrap it again. This leads to having multiple times the same documents.

The current solution would be to add a distinctAttribute: "content" in the meilisearch settings of your scrapix configuration.

Solution

The long term solution would be to create a new field in Meilisearch containing the hash of a document with its relevant fields. For example in section_hash we add a hash of all the different fields:

We then add section_hash by default in the distinctAttributes here for example https://github.com/meilisearch/scrapix/blob/070c9074b8b313de8714575da7941054c7100ce5/src/scrapers/docssearch.ts#L13

But also in the default strategy

qdequele commented 1 year ago

Not an easy one without slowing down the process a lot or having a hidden field on the document. 🤔