safesploitOrg / doogle

Doogle is a search engine and web crawler which can search indexed websites and images
https://search.safesploit.com/
MIT License
32 stars 16 forks source link

Checkup ? Question (Sitemap Crawl Functionality | Crawling Description Question) #6

Closed RedWilly closed 1 year ago

RedWilly commented 1 year ago

Hey, i wanted to ask you some questions.

so the question is i wanted to use your search engine to index a website using sitemap.xml ( index and crawl the whole content from the website) this way it will be easier to pinpoint the engine on what pages it needs to search on. it would be much more easier to find content you are looking for.

because I followed your Read.me file but each time Doodle crawl through a website I find out that it only saves the page title and the website description. eg. Hackernew website. ( when I index and search for a keyword the result is almost the same( description) but the URL is present and the title is not.

eg. when I search for Malware

the result present is title: Malware Strains Targeting Python and JavaScript Developers description: The Hacker News is the most trusted and popular cybersecurity publication for information security professionals seeking breaking news, actionable insights https://thehackernews.com/2022/12/malware-strains-targeting-python-and.html

see the description uses the main website description instead of the blog page.

am not sure if am missing something.

pedrolaxe commented 1 year ago

@RedWilly I'm having to make updates, but from what I found this code was copied from another repo. basically what he did was create a more complete readme and change the name of the project.

I would really like it to be from the original author to continue with the improvements. Original repo: https://github.com/phucvo0709/Clone-Google-Search-Engine

safesploit commented 1 year ago

@pedrolaxe

I initially developed Doogle from Reece Kenney's course Google search engine clone. I suspect the Clone-Google-Search-Engine repo you provided was built using his course too, as I can see object-oriented PHP and PDO references similar to Doogle. No copying of repos occurred.

safesploit commented 1 year ago

@RedWilly I never thought about using sitemap.xml to crawl the website. Doogle crawls and inserts database entries using links using the insertLink($url, $title, $description, $keywords) and images using insertImage($url, $src, $alt, $title) functions respectively. As of v1.1.2-beta all crawling functionality is contained within crawl.php and classes/DomDocumentParser.php.

Regarding the link you provided, I am not able to replicate your issue (see image below). I am running Doogle v1.1.2-beta and PHP 8.1.

database sample