iiitv / lyrics-crawler

Simple crawler to collect lyrics, written in Python
Apache License 2.0
10 stars 4 forks source link

Handle erroneous tasks differently. #16

Closed singhpratyush closed 8 years ago

singhpratyush commented 8 years ago

In LIFO queue (or simply stack), if an error is encountered for a task, the task is (bound to) put back to the top of the queue. If the error persists, any one of the threads will always be wasted trying to do the task.

Find a better way to do so.

singhpratyush commented 8 years ago

Use priority queue and put erroneous task at least priority.

Advantages

Erroneous tasks won't be taken again and again, until all other tasks are complete.

Disadvantages

If the problem is likely to persist for a long time, threaders (workers) will try to complete them and queue will not be empty. Therefore we will not be able to bring in new tasks for fresh crawl.

singhpratyush commented 8 years ago

Maintain different queue for erroneous tasks

Maintain a different queue (and some threaders for it) that will keep on working no matter what. Duplicate tasks need to be handled seperately.

singhpratyush commented 8 years ago

@aviaryan @saurabhjn76 @bxute : Please give some views.

aviaryan commented 8 years ago

@singh-pratyush96 Are you using a timeout (or no of trials) ? You can use a backoff mechanism . Using a map as data structure, you store the jobs in the queue . [G][G][G][X][G][G][][][][][][]..... Let EP be the current natural end position of the queue, in the above example it is 6. Let SP be the point till where jobs have been done. At start it is 0. Let's suppose 'X' job is bound to fail always.

Suppose all the G jobs ran well and now it's time for X . X job now crosses the timeout (or max re-tries whatever) and so it will be put back in the map at position at offset of Z (suppose 20). Therefore new position of X job is 20+4. Note that EP is still 6.

If new jobs are needed to be added, EP is incremented. Here new jobs will be added at 7, then 8, then 9 and so on. If EP grows up to 23, then new job will see that 24 exists, so it will be added to 25 and EP too becomes 25.

I hope you get the basic idea what I am saying. You can also increase the increase offset of an item by factor in case of second failure, third and so on. For that you use another map ZF .. when item 4 failed for the first time, set ZF[ 4 + 20 + ZF[4]20 ] = ZF[4]+1 So next time ZF[24 + 20 + ZF[24]20 ] = ZF[24]+1 If ZF is not set at some value, that means ZF for it is 0.

If I get the situtation correctly, this should work. ( there is one queue and there are many threads which take item from queue one by one. )

singhpratyush commented 8 years ago

The solution you mentioned is a classic and effective one.

But the main issue is about deciding when to assume that one crawl cycle is finished. Currently, the execution as follows -

while task_queue is not empty:
    try:
        take a task from work queue and complete it
    except:
        put the task back in queue

This is how workers do their job. Once all workers have finished their job (while loop for all of them is finished), next crawl cycle starts.

Now, for the X you mentioned, there will be at least one thread trying to complete the task hence stopping from next crawl cycle. In the end, we will end up having one thread continuously taking X from queue, trying to complete it, getting error and putting it back in the queue, and we will not get past the join() call for the thread.

This is one of the reason why I suggested a different queue and thread for erroneous tasks. We will not have to join on it and rest of the stuff can execute as normal.

Hope you got it.

aviaryan commented 8 years ago

What are the reasons that a task fails ?

  1. Network problem
  2. Disconnection from server because too many connections
  3. ??

I think anaylzing these will give you the solution.

aviaryan commented 8 years ago

Now, for the X you mentioned, there will be at least one thread trying to complete the task hence stopping from next crawl cycle.

You can limit the no of retries for a job. If value of ZF exceeds 10, delete the job and if all jobs deleted, start the next crawl. (hoping that failed jobs will be taken care in next crawl)

This is great if there are jobs that will never ever succeed. If a job is going to eventually succeed after not so large amount of retries, your different queue method will work good.

singhpratyush commented 8 years ago

Mostly network connection problem. But some other reasons are Error 404 and Error 500, and these errors do persist for a long time (or even forever) for a web page.

singhpratyush commented 8 years ago

You can limit the no of retries for a job. If value of ZF exceeds 10, delete the job and if all jobs deleted, start the next crawl. (hoping that failed jobs will be taken care in next crawl)

Looks great. I think I will proceed with it.