jaeles-project / gospider

Gospider - Fast web spider written in Go
MIT License
2.56k stars 310 forks source link

Duplicate URL's #23

Open jaikishantulswani opened 4 years ago

jaikishantulswani commented 4 years ago

@j3ssie any way to avoid duplicate urls as it never ends on some domains and keep continue with duplicate urls which increase the time to crawl for hours on same requests [url] - [code-200] - https://example.com/ [url] - [code-200] - https://example.com/ [url] - [code-200] - https://example.com/ [url] - [code-200] - https://example.com/ [url] - [code-200] - https://example.com/ [url] - [code-200] - https://example.com/ [url] - [code-200] - https://example.com/

It also needed a switch to filter / skip urls throwing particular status code.

StasonJatham commented 3 years ago

Hey buddy,

i figured I would post it in here too https://github.com/jaeles-project/gospider/issues/21#issuecomment-953916547

They actually try to get rid of duplicates with their "stringset" implementation.... funny thing is they actually don't need that entire code because colly handles this for them. The issue seems to be that if the same URL is found in, for example, a form or an a tag, it is not checked, only URLs that are found inside forms are checked with URLs found in other forms not against a's ..... long story short.. you can modify crawler to actually use colly s built in filter and then it works.

I can't really share the code cause I use it as a library not a command line tool, so I took out cobra and start it with a config file

I don't really care about status code but implementing that filter is easy

// They give you a status code as such 
response.StatusCode

you can then add an if statement before .Visit() is run and check for that errorcode