imthaghost / goclone

Website Cloner - Utilizes powerful Go routines to clone websites to your computer within seconds.
https://goclone.io
MIT License
1.32k stars 282 forks source link

Panic when DNS resolve issues #77

Open boyter opened 1 month ago

boyter commented 1 month ago

I am running a PiHole and thought I would try goclone against some websites, and encountered the following colly issue

$ goclone https://searchcode.com/
Extracting -->  https://searchcode.com/
Css found --> /static/css/newstyles.css
Extracting -->  https://searchcode.com/static/css/newstyles.css
Js found --> //cdn.carbonads.com/carbon.js?zoneid=1673&serve=C6AILKT&placement=searchcodecom
Extracting -->  https://cdn.carbonads.com/carbon.js?zoneid=1673&serve=C6AILKT&placement=searchcodecom
panic: Get "https://cdn.carbonads.com/carbon.js?zoneid=1673&serve=C6AILKT&placement=searchcodecom": dial tcp 0.0.0.0:443: connect: connection refused

goroutine 35 [running]:
github.com/imthaghost/goclone/pkg/crawler.Extractor({0x14000407c00, 0x55}, {0x140003a6000, 0x21})
    /Users/ghost/go/src/github.com/imthaghost/goclone/pkg/crawler/extractor.go:35 +0x24c
github.com/imthaghost/goclone/pkg/crawler.Collector.func2(0x140004aec60)
    /Users/ghost/go/src/github.com/imthaghost/goclone/pkg/crawler/collector.go:37 +0x120
github.com/gocolly/colly/v2.(*Collector).handleOnHTML.func1(0x0, 0x140004a1560)
    /Users/ghost/go/pkg/mod/github.com/gocolly/colly/v2@v2.1.0/colly.go:1074 +0x70
github.com/PuerkitoBio/goquery.(*Selection).Each(0x140004a1530, 0x14000073e30)
    /Users/ghost/go/pkg/mod/github.com/!puerkito!bio/goquery@v1.5.1/iteration.go:10 +0x50
github.com/gocolly/colly/v2.(*Collector).handleOnHTML(0x140003ac000, 0x140003c06c0)
    /Users/ghost/go/pkg/mod/github.com/gocolly/colly/v2@v2.1.0/colly.go:1064 +0x288
github.com/gocolly/colly/v2.(*Collector).fetch(0x140003ac000, {0x140003a4060, 0x17}, {0x10531f364, 0x3}, 0x1, {0x0, 0x0}, 0x0, 0x1400038c210, ...)
    /Users/ghost/go/pkg/mod/github.com/gocolly/colly/v2@v2.1.0/colly.go:676 +0x7a0
created by github.com/gocolly/colly/v2.(*Collector).scrape
    /Users/ghost/go/pkg/mod/github.com/gocolly/colly/v2@v2.1.0/colly.go:574 +0x43c

This only occurs when running against websites that have blocked content, which then throws the above. While portions of the site are still cloned such an error seems like something that should be handled.

Disabling the pi-hole resolves the issue. While I understand pi-hole is not the expected path, I imagine DNS might be configured in some cases and produce something like the above.