sjdirect / abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Apache License 2.0
2.25k stars 560 forks source link

Library does not work when link crawled is very slow #203

Closed sbonello closed 5 years ago

sbonello commented 5 years ago

When the library is used on a link that is very slow the work threads count goes to zero and the threadmanager closes as there wasn't enough time for the links to be scheduled.

The delay added to wait for the list to be built is not enough

Log.Debug("Waiting for links to be scheduled..."); await Task.Delay(2500).ConfigureAwait(false);

erherhh4herh commented 5 years ago

Can confirm, abot2 just does not work at all... Recommend you stick with abot1 💯

sjdirect commented 5 years ago

Can you give me some example sites/pages please?

sbonello commented 5 years ago

http://zalahat.com

Is an example. I have more.

On Sun, Oct 6, 2019 at 4:22 PM Steven notifications@github.com wrote:

Can you give me a some examples sites/pages please?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abot/issues/203?email_source=notifications&email_token=AD7XTOBJVSCHJPZNWS4QBWDQNHYC7A5CNFSM4I32VE3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAOLHCA#issuecomment-538751880, or mute the thread https://github.com/notifications/unsubscribe-auth/AD7XTOBR4VAJWKG6OWVSTFLQNHYC7ANCNFSM4I32VE3A .

sjdirect commented 5 years ago

Appreciate it. Yes a few more world be helpful as well.

sbonello commented 5 years ago

gamble1x2.com casinosms.pl centrodeapostas.com

Anything slower than 3 seconds will not be parsed.

On Sun, Oct 6, 2019 at 5:10 PM Steven notifications@github.com wrote:

Appreciate it. Yes a few more world be helpful as well.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abot/issues/203?email_source=notifications&email_token=AD7XTOBLLD2ANKLJUETFS3TQNH5XNA5CNFSM4I32VE3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAOMJTI#issuecomment-538756301, or mute the thread https://github.com/notifications/unsubscribe-auth/AD7XTOD3ZNTD74QLKKDDNC3QNH5XNANCNFSM4I32VE3A .

sjdirect commented 5 years ago

Instead of just waiting longer I reverted back to the old way of blocking the thread with Thread.Sleep.

sjdirect commented 5 years ago

@sbonello @erherhh4herh Can you both verify this has been fixed with nuget version 2.0.45?

sbonello commented 5 years ago

Nope same issue. If the site takes longer than 2.5 seconds to load it will not be parsed.

Simon

On Sun, Oct 6, 2019 at 5:01 PM Simon Bonello sbonello@gmail.com wrote:

http://zalahat.com

Is an example. I have more.

On Sun, Oct 6, 2019 at 4:22 PM Steven notifications@github.com wrote:

Can you give me a some examples sites/pages please?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abot/issues/203?email_source=notifications&email_token=AD7XTOBJVSCHJPZNWS4QBWDQNHYC7A5CNFSM4I32VE3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAOLHCA#issuecomment-538751880, or mute the thread https://github.com/notifications/unsubscribe-auth/AD7XTOBR4VAJWKG6OWVSTFLQNHYC7ANCNFSM4I32VE3A .

sjdirect commented 5 years ago

Am able to reproduce and have a fix i'm still testing. Bear with me, hopefully i can polish it up tomorrow.

sjdirect commented 5 years ago

I believe I fixed the issue. Even added some integration tests to make sure Zalahat is crawled correctly. The latest nuget package (2.0.46) should be up to date. Please verify whether it works or does not for you.

sjdirect commented 5 years ago

Assuming the silence means its working as expected.