making a search engine - Githubissues

gaby-de-wilde commented 2 months ago

You mentioned wanting to make a search engine. I had this reoccurring thought about it. Not sure where to put it so ill put it here (I guess)

I haven't looked at it for over a decade but the p2p search engine YaCy is very old and it worked just fine. Something similar shouldn't be to hard to make. We are spoiled with tools now.

The sales pitch is simple, you download the crawler, point it at your own blog index it and build an index of pages that your blog links to. Then index the pages those pages link to. etc You simply crank up the depth whenever you like it.

If you are a half decent blogger you have articles that link to most of the important websites that fit the subject of your blog.

You put a search box/page on your website that connects to your desktop client, your visitors can search with options:

articles on this blog,
related pages you've linked to,
1-5 depth to broaden the topical search (but less related articles)
search other instances

It scales so well because searching your own blog is the most important, linked pages is pretty nice to have, deeper crawls are still useful but much less important and searching other instances, the anti climax if you like, is great but the least important.

The most crappy hardware can do 50 000 per day, if you run it slowly in the background [say] 100 pages on average per day is still 36 500 every year.

More usual is to be excited about the new found tool and run it for a few hours the first day. You are initially shocked how useful it is. The next day you crawl a few more pages until you get bored with it. You look again after a while and do one more good crawl. Few years later and you have an oddly large index.

You might want to run it automatically when your rss updates.

If you use it once in a while it is easy to ban some instances full of spam.

YaCy checks all results returned by other nodes by fetching the html and looking for the keywords on the page. This worked well. A very stale index may reflect poorly on the node but it may also be full of material that is important to you.

You would get crusty results at times but this is a feature not a bug. There is no man behind the curtain who is the big decider what you may and may not look at.

If your client is not running the search box/page on your blog only does p2p but it is likely able to still search your domain. What is a lot of posts for a blog is not a lot for a crawler. You can glue all kinds of products onto this. Besides a db YaCy keeps the full text of all pages crawled but only the text. If users want a feature that cant be done for free you can sell it to them. If someone has a website that is hard to index they can customize their crawler themselves or pay to have it done.

vaguely-tagged commented 2 months ago

Super interesting, thanks for taking the time to comment. Definitely going to look into it and try to implement something. Out of curiosity how did you happen to find my blog ?

gaby-de-wilde commented 2 months ago

on hn, before the topic was flagged -.-

https://news.ycombinator.com/item?id=41478785

I think you briefly made it to the front page :)

vaguely-tagged commented 2 months ago

Ah nice, yeah I know the rest of HN wasn't too happy with the post and the plug for looking for a job haha.

vaguely-tagged / website

making a search engine #1