Open solomonxie opened 2 years ago
Steps:
Seed:
The first link you start to scrape the whole internet. Easier to start to
Wikipedia
From the first link we pull all the info and links, and from the links we pull further info and links, so on and so forth.
For each language, we draw an unlimited table sheet, which is a rectangle matrix with a list of words as both row headers and column headers.
STEP ONE: Page to index We first pick out all the words in the page, no order, then iterate each word: Find the word at its row, then add the link-id to this word-level pool,
Then iterate next word, stay at the same row, and we find the column of this word so that we get a cell as a intersection pool, we add link-id to this cell-pool. Then we read next word and do the same until we finished all words.
Then we do another Page and do the same thing over and over again.
Now we have a whole table with each cell has a pool of links.
When user type the first keyword, we find that keyword row, get the row-level pool. Then user type the second keyword, we stay at the same row and mark the intersection, then user type the third word, we mark another intersection in this row. When user finished typing keywords, we take out all the keyword cells, and calculate the intersection of all marked pools, which yields out a list of link-ids.
That list can be huge, so we have to have pagination, and decide what links will be displayed first. Now we need a PageRank strategy: dimensional position of each word.
Each page must have a multidimensional position, the amount of dimensions is the same as row/column count. User input actually specifies a partial dimension info, What we need to do is to find the nearest pages around that coordinates.
Imagine that: user gives x, y, z. But we have a-z dimensions. What we can do is to first get all points having the same xyz value, and yields first the higher General higher PageRanked pages.
For each iteration of page parsing, we should update its PageRank.
Btw, instead of adding 1 count for each intersection, we can introduce a concept of “word distance” to weight the count. A word before the target word, get distance of -1, and after 1. And the next word after get distance of 2.
With this algorithm for indexing, theoretically, we don’t need to store any page content but purely calculate on the fly. All we need to store is the index for urls.
REF: https://en.wikipedia.org/wiki/Word_embedding?oldformat=true
REF: https://en.wikipedia.org/wiki/Word2vec?oldformat=true
REF: https://en.wikipedia.org/wiki/Hyponymy_and_hypernymy?oldformat=true
REF: https://en.wikipedia.org/wiki/Inverted_index
REF: https://en.wikipedia.org/wiki/PageRank
So we can't just fetch the target page itself, because modern days most page contents are loaded asynchronously by Ajax. And we also don't want to use headless browser for this because it's too slow and memory consuming.
That's where we introduce Server Side Rendering (SSR)
, which renders a given page with all its Ajax requests to have a page with full content.
There are some popular 3rd party tools for the SSR:
Gatsby
: https://www.gatsbyjs.com/Next.js
: https://nextjs.org/docs/basic-features/pages
Project Draft
Project vision: This search engine is NOT a general search engine like Google, but a in-depth domain search engine.
Features
TODO:
Inventory
Inventory:
Architecture | Tech Stach
Stach:
Antiscrape Strategy
It'll be much effort to implement Proxy strategy for scraping, so in this project we are to use Single IP for scraping.
Need to have a
domain-aware
scraping schedular, to avoid too many requests into the same domain.