Delete selected URL then synchronize using scheduler truncate the index

jaeksoft / opensearchserver

Open-source Enterprise Grade Search Engine Software

http://www.opensearchserver.com

Apache License 2.0

499 stars 190 forks source link

Delete selected URL then synchronize using scheduler truncate the index #482

Open L2B opened 10 years ago

L2B commented 10 years ago

Hi,

Using the scheduler "Web crawler - URL database" 2 tasks in my scheduler

1 - I delete selected URL with fetch status = Gone 2 - I synchronize URL with Fetch status = Fetch => this truncate the index

I have tested a scheduler with only the first tasks, it delete correctly the gone URL.

OSS :1.5.2

AlexandreToyer commented 10 years ago

Hello,

Could you please run the Web crawler - URL database task with these options:

Command: Synchronize
Robots.txt: All
Fetch status: Fetched
Parser status: Parsed
Index status: Indexed

The selection of URL must be done precisely otherwise every documents would be deleted from the index.

I tested it this way and it is working properly.

You could also use the Crawler / Web / URL browser prior to the execution of the scheduler task to check the URLs matching the criterias (fetched, parsed, indexed) and compare them to the URLs of the documents stored in your index. You should of course find some matchings.

One last thing: this Synchronize task uses the first mapping for url found in Crawler / Web / Field mapping. You must ensure that this tab contains a mapping between the piece of information url and one field of your schema.

Regards, Alexandre

L2B commented 10 years ago

Hi Alex,

Synchronisation process works fine with all the parameters. So it will keep my index clean.

I'm still having some trouble with the "Delete selection" command thru scheduler. Maybe I misunderstand something

In my Web crawler - Url database I have the following task

- Command: Delete selection
- Robots.txt: All
- Fetch status: Gone
- Parser status: Parsed
- Index status: Indexed

In my URL browser, 7 are returned with this search, so my scheduler should remove these URL from the URL database.

When I launch my task, these 7 URL are not deleted, but this delete all URL with Status = Fetch Parse status = parsed Index status = Indexed

What I'm triyng to do is to keep my URL database clean, and remove automatically all gone URL

Regards Laurent

AlexandreToyer commented 10 years ago

Hi Laurent,

I am not able to reproduce this issue in my environment. Are you sure that some URLs are in a Gone status and also parsed and indexed?

Can you try with the latest version of OpenSearchServer? http://www.open-search-server.com/ftp/OpenSearchServer_1.5/build-1.5-b491/

Thank you, Alexandre

L2B commented 10 years ago

Hi Alex, i confirm the Gone / Parsed / indexed status. I will download the latest version, and rebuild my index and i will let you know Laurent

capture du 2014-03-05