solomonxie / sengine_101

Build a personal search engine from scratch for study
Apache License 2.0
0 stars 0 forks source link

Project Draft #1

Open solomonxie opened 2 years ago

solomonxie commented 2 years ago

Project Draft

Project vision: This search engine is NOT a general search engine like Google, but a in-depth domain search engine.

Features

TODO:

Inventory

Inventory:

Architecture | Tech Stach

Stach:

Antiscrape Strategy

It'll be much effort to implement Proxy strategy for scraping, so in this project we are to use Single IP for scraping.

Need to have a domain-aware scraping schedular, to avoid too many requests into the same domain.

solomonxie commented 2 years ago

How to build search engine from Scratch

Steps:

  1. Crawling
  2. Parsing
  3. Indexing

Crawling

Seed:

The first link you start to scrape the whole internet. Easier to start to Wikipedia From the first link we pull all the info and links, and from the links we pull further info and links, so on and so forth.

Parsing

Indexing

solomonxie commented 2 years ago

Idea: Page2Vec: Matrix General Search Engine Algorithm

For each language, we draw an unlimited table sheet, which is a rectangle matrix with a list of words as both row headers and column headers.

STEP ONE: Page to index We first pick out all the words in the page, no order, then iterate each word: Find the word at its row, then add the link-id to this word-level pool,

Then iterate next word, stay at the same row, and we find the column of this word so that we get a cell as a intersection pool, we add link-id to this cell-pool. Then we read next word and do the same until we finished all words.

Then we do another Page and do the same thing over and over again.

Now we have a whole table with each cell has a pool of links.

When user type the first keyword, we find that keyword row, get the row-level pool. Then user type the second keyword, we stay at the same row and mark the intersection, then user type the third word, we mark another intersection in this row. When user finished typing keywords, we take out all the keyword cells, and calculate the intersection of all marked pools, which yields out a list of link-ids.

That list can be huge, so we have to have pagination, and decide what links will be displayed first. Now we need a PageRank strategy: dimensional position of each word.

Each page must have a multidimensional position, the amount of dimensions is the same as row/column count. User input actually specifies a partial dimension info, What we need to do is to find the nearest pages around that coordinates.

Imagine that: user gives x, y, z. But we have a-z dimensions. What we can do is to first get all points having the same xyz value, and yields first the higher General higher PageRanked pages.

For each iteration of page parsing, we should update its PageRank.

Btw, instead of adding 1 count for each intersection, we can introduce a concept of “word distance” to weight the count. A word before the target word, get distance of -1, and after 1. And the next word after get distance of 2.

With this algorithm for indexing, theoretically, we don’t need to store any page content but purely calculate on the fly. All we need to store is the index for urls.

solomonxie commented 2 years ago

Study: NLP Algorithms on Search Engine

Word Embedding

REF: https://en.wikipedia.org/wiki/Word_embedding?oldformat=true

image

Word2Vec

REF: https://en.wikipedia.org/wiki/Word2vec?oldformat=true

image

Hypernyms

REF: https://en.wikipedia.org/wiki/Hyponymy_and_hypernymy?oldformat=true

image

solomonxie commented 2 years ago

Study: PageRank & Inverted Index for Search Engine

Inverted Index

REF: https://en.wikipedia.org/wiki/Inverted_index

image

PageRank

REF: https://en.wikipedia.org/wiki/PageRank

image

solomonxie commented 2 years ago

Study: SSR - Server Side Rendering

So we can't just fetch the target page itself, because modern days most page contents are loaded asynchronously by Ajax. And we also don't want to use headless browser for this because it's too slow and memory consuming.

That's where we introduce Server Side Rendering (SSR), which renders a given page with all its Ajax requests to have a page with full content.

There are some popular 3rd party tools for the SSR:

REF: https://youtu.be/BKZxZwUgL3Y?t=726