jaeksoft / opensearchserver

Open-source Enterprise Grade Search Engine Software
http://www.opensearchserver.com
Apache License 2.0
499 stars 190 forks source link

Duplicate documents after database crawl #1698

Open ghost opened 8 years ago

ghost commented 8 years ago

Hey, I have set up an OSS (via .deb package) with a database crawler and it used to work perfectly until I did another crawl. With every db crawl we do everything is created again, so if I do two db crawls I have every document twice. So with every new db crawl one sees one more search result for the same thing. Is there a way to implement some sort of deduplication in OSS or have I set up the db crawler wrong? For now I deleted all documents and did one re-crawl.

Thanks in advance and have a nice day!

emmanuel-keller commented 8 years ago
  1. You have to choose a unique field in the schema (as shown in the screenshot).
  2. In the Database crawler you should link an SQL column with the unique field.

screen shot 2016-01-04 at 23 21 15