nikhil-thomas / web-crawler

an application to generate site map of a given domain
MIT License
0 stars 0 forks source link

crawling a website results in a panic #1

Open sthaha opened 5 years ago

sthaha commented 5 years ago

Steps to reproduce

  1. build configurable-crawler with race flag enabled
  2. run crawler as below
    configurable-crawler -l 0 -t 100ms http://dev.thaha.xyz 

    Expected Result:

    prints the sitemap without any crash

Actual

panics at the end of execution

INFO[0000] root   : http://dev.thaha.xyz
INFO[0000] skip   : mailto:sun.ilth.aha@gmail.com
INFO[0000] skip   : https://github.com/sthaha
INFO[0000] add    : http://dev.thaha.xyz/
INFO[0000] add    : http://dev.thaha.xyz/about/
INFO[0000] add    : http://dev.thaha.xyz/shell/tools/xargs/2017/10/12/fun-with-xargs.html
INFO[0000] add    : http://dev.thaha.xyz/git/notes/2017/02/01/git-notes-02.html
INFO[0000] add    : http://dev.thaha.xyz/git/notes/2016/12/06/git-aliases.html
INFO[0000] add    : http://dev.thaha.xyz/ruby/exception-handling/2016/11/25/TIL-why-not-rescue-exception.html
INFO[0000] add    : http://dev.thaha.xyz/rails/activerecord/count/2016/11/25/TIL-activerecord-association-count.html
INFO[0000] add    : http://dev.thaha.xyz/feed.xml
INFO[0000] links  : 1 : queue : 0
INFO[0000] queue  : empty : start crawiling stop timeout : 100ms
INFO[0000] skip   : mailto:sun.ilth.aha@gmail.com
INFO[0000] skip   : https://github.com/sthaha
INFO[0000] links  : 2 : queue : 0
INFO[0000] queue  : empty : start crawiling stop timeout : 100ms
INFO[0000] queue  : empty : stop crawiling

::::: Site Map: http://dev.thaha.xyz ::::

http://dev.thaha.xyz
  http://dev.thaha.xyz/
  http://dev.thaha.xyz/about/
  http://dev.thaha.xyz/shell/tools/xargs/2017/10/12/fun-with-xargs.html
  http://dev.thaha.xyz/git/notes/2017/02/01/git-notes-02.html
  http://dev.thaha.xyz/git/notes/2016/12/06/git-aliases.html
  http://dev.thaha.xyz/ruby/exception-handling/2016/11/25/TIL-why-not-rescue-exception.html
  http://dev.thaha.xyz/rails/activerecord/count/2016/11/25/TIL-activerecord-association-count.html
  http://dev.thaha.xyz/feed.xml
INFO[0000] queue  : empty : stop crawiling
panic: close of closed channel

goroutine 85 [running]:
github.com/nikhil-thomas/web-crawler/internal/crawlers/concurrent.endOperationTimeout(0xc4200922a0, 0xc4200bc480)
        /home/sthaha/go/src/github.com/nikhil-thomas/web-crawler/internal/crawlers/concurrent/concurrent.go:176 +0x18b
created by github.com/nikhil-thomas/web-crawler/internal/crawlers/concurrent.makeSiteMap.func1
        /home/sthaha/go/src/github.com/nikhil-thomas/web-crawler/internal/crawlers/concurrent/concurrent.go:160 +0x480
nikhil-thomas commented 5 years ago

This happens because of a duplicate timout and a channel close.

This occurs when links-to-be processed queue is temporarily empty. The timeout was used to makes sure no new links are missed due to network delay, resulting in a partial sitemap.

I believe a cancellable timeout using a context.WithCancel() context will resolve the issue.

I shall implement the fix and push the changes in 2 days.

nikhil-thomas commented 5 years ago

fixed in 7f5c937bfe8a77a665f0943dcbc713b2bcb03664 Added a mechanism to reset timeout if there is an active timeout