safesploitOrg / doogle

Doogle is a search engine and web crawler which can search indexed websites and images
https://search.safesploit.com/
MIT License
32 stars 16 forks source link

Bug: Crawling non-ASCII characters (URL) #4

Open safesploit opened 2 years ago

safesploit commented 2 years ago

When crawling the Japanese Wikipedia ja.wikipedia.org/wiki/メインページ the following URL is indexed https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8

dehlirious commented 1 year ago

Hey, I've written this up and it works, but am I missing anything?

Tested and it functions fine, tested a url with a ` character(only thing not covered by htmlspecialchars) and it didn't break it

I've also noticed that html tags are removed from URL titles(if title says "<b>Hi" it results in "Hi", which kindof is an issue depending on the circumstance, I'd rather it be processed with htmlspecialchars than removed. Anyway,

Line 88 of crawl-manual insert $url = htmlspecialchars(urldecode($url),ENT_QUOTES, "UTF-8");