Closed singhpratyush closed 8 years ago
priority queue
and put erroneous task at least priority.Erroneous tasks won't be taken again and again, until all other tasks are complete.
If the problem is likely to persist for a long time, threaders
(workers) will try to complete them and queue will not be empty. Therefore we will not be able to bring in new tasks for fresh crawl.
Maintain a different queue (and some threaders
for it) that will keep on working no matter what. Duplicate tasks need to be handled seperately.
@aviaryan @saurabhjn76 @bxute : Please give some views.
@singh-pratyush96 Are you using a timeout (or no of trials) ? You can use a backoff mechanism . Using a map as data structure, you store the jobs in the queue . [G][G][G][X][G][G][][][][][][]..... Let EP be the current natural end position of the queue, in the above example it is 6. Let SP be the point till where jobs have been done. At start it is 0. Let's suppose 'X' job is bound to fail always.
Suppose all the G jobs ran well and now it's time for X . X job now crosses the timeout (or max re-tries whatever) and so it will be put back in the map at position at offset of Z (suppose 20). Therefore new position of X job is 20+4. Note that EP is still 6.
If new jobs are needed to be added, EP is incremented. Here new jobs will be added at 7, then 8, then 9 and so on. If EP grows up to 23, then new job will see that 24 exists, so it will be added to 25 and EP too becomes 25.
I hope you get the basic idea what I am saying. You can also increase the increase offset of an item by factor in case of second failure, third and so on. For that you use another map ZF .. when item 4 failed for the first time, set ZF[ 4 + 20 + ZF[4]20 ] = ZF[4]+1 So next time ZF[24 + 20 + ZF[24]20 ] = ZF[24]+1 If ZF is not set at some value, that means ZF for it is 0.
If I get the situtation correctly, this should work. ( there is one queue and there are many threads which take item from queue one by one. )
The solution you mentioned is a classic and effective one.
But the main issue is about deciding when to assume that one crawl cycle is finished. Currently, the execution as follows -
while task_queue is not empty:
try:
take a task from work queue and complete it
except:
put the task back in queue
This is how workers do their job. Once all workers have finished their job (while loop for all of them is finished), next crawl cycle starts.
Now, for the X
you mentioned, there will be at least one thread trying to complete the task hence stopping from next crawl cycle. In the end, we will end up having one thread continuously taking X
from queue, trying to complete it, getting error and putting it back in the queue, and we will not get past the join()
call for the thread.
This is one of the reason why I suggested a different queue and thread for erroneous tasks. We will not have to join on it and rest of the stuff can execute as normal.
Hope you got it.
What are the reasons that a task fails ?
I think anaylzing these will give you the solution.
Now, for the X you mentioned, there will be at least one thread trying to complete the task hence stopping from next crawl cycle.
You can limit the no of retries for a job. If value of ZF exceeds 10, delete the job and if all jobs deleted, start the next crawl. (hoping that failed jobs will be taken care in next crawl)
This is great if there are jobs that will never ever succeed. If a job is going to eventually succeed after not so large amount of retries, your different queue method will work good.
Mostly network connection problem. But some other reasons are Error 404
and Error 500
, and these errors do persist for a long time (or even forever) for a web page.
You can limit the no of retries for a job. If value of ZF exceeds 10, delete the job and if all jobs deleted, start the next crawl. (hoping that failed jobs will be taken care in next crawl)
Looks great. I think I will proceed with it.
In
LIFO queue
(or simplystack
), if an error is encountered for a task, the task is (bound to) put back to the top of the queue. If the error persists, any one of the threads will always be wasted trying to do the task.Find a better way to do so.