Open L2B opened 10 years ago
Hello,
Could you please run the Web crawler - URL database
task with these options:
The selection of URL must be done precisely otherwise every documents would be deleted from the index.
I tested it this way and it is working properly.
You could also use the Crawler / Web / URL browser
prior to the execution of the scheduler task to check the URLs matching the criterias (fetched, parsed, indexed) and compare them to the URLs of the documents stored in your index. You should of course find some matchings.
One last thing: this Synchronize
task uses the first mapping for url
found in Crawler / Web / Field mapping
. You must ensure that this tab contains a mapping between the piece of information url
and one field of your schema.
Regards, Alexandre
Hi Alex,
Synchronisation process works fine with all the parameters. So it will keep my index clean.
I'm still having some trouble with the "Delete selection" command thru scheduler. Maybe I misunderstand something
In my Web crawler - Url database I have the following task
- Command: Delete selection
- Robots.txt: All
- Fetch status: Gone
- Parser status: Parsed
- Index status: Indexed
In my URL browser, 7 are returned with this search, so my scheduler should remove these URL from the URL database.
When I launch my task, these 7 URL are not deleted, but this delete all URL with Status = Fetch Parse status = parsed Index status = Indexed
What I'm triyng to do is to keep my URL database clean, and remove automatically all gone URL
Regards Laurent
Hi Laurent,
I am not able to reproduce this issue in my environment. Are you sure that some URLs are in a Gone
status and also parsed
and indexed
?
Can you try with the latest version of OpenSearchServer? http://www.open-search-server.com/ftp/OpenSearchServer_1.5/build-1.5-b491/
Thank you, Alexandre
Hi Alex, i confirm the Gone / Parsed / indexed status. I will download the latest version, and rebuild my index and i will let you know Laurent
Hi,
Using the scheduler "Web crawler - URL database" 2 tasks in my scheduler
1 - I delete selected URL with fetch status = Gone 2 - I synchronize URL with Fetch status = Fetch => this truncate the index
I have tested a scheduler with only the first tasks, it delete correctly the gone URL.
OSS :1.5.2