searchmysite / searchmysite.net

searchmysite.net is an open source search engine and search as a service
GNU Affero General Public License v3.0
75 stars 7 forks source link

Indexing: Add an incremental reindex (only indexing new items) #34

Closed m-i-l closed 2 years ago

m-i-l commented 3 years ago

As per the post searchmysite.net: The delicate matter of the bill one of the less desirable "features" of the searchmysite.net model is that it burns up a lot of money indexing sites on a regular basis even if no-one is actually using the system. It would therefore be good to try to reduce indexing costs.

One idea is to only reindex sites and/or pages which have been updated. It doesn't look like there is a reliable way of doing this though, e.g. given only around 45% of pages in the system currently return a Last-Modified header, so there may need to be some "good enough" only-if-probably-modified approach.

For the only-if-probably-modified approach, one idea may be to store the entire home page in Solr, and at the start of reindexing that site compare the last home page with the new home page - if they are different, then proceed with reindexing that site, and if they are the same, do not reindex that site. There are some issues with this, e.g. if the page has some auto-generated text which changes on each page load, e.g. a timestamp, it will always register as different even if it isn't, and conversely there may be pages within the site which have been updated even if the home page hasn't changed at all. It might therefore be safest to have, e.g. a weekly only-if-probably-modified reindex and monthly reindex-everything-regardless (i.e. the current) approach as a fail-safe.

ScootRay commented 3 years ago

I wonder if polling their RSS feed would be a solution? I realize not all sites offer RSS feeds but it seems the majority do and for those that have it, you can poll their RSS feed instead of crawling. It should also solve the issue of the knowing what's been updated or not since RSS takes care of that.

Ray

m-i-l commented 3 years ago

Yes, looking for changes in the RSS feed and/or sitemap sounds a good idea.

I'm using Scrapy's generic CrawlSpider, and could continue using that for the less frequent "reindex everything regardless".

However, there is a SitemapSpider class, and it can be extended to only yield items to index based on a user-defined class (via the sitemap_filter) so that could potentially do a comparison of previous item dates.

It would need a bit of a rethink of the whole approach to indexing though, so not trivial.

m-i-l commented 2 years ago

I've renamed this to reflect the focus on the slightly simpler aim of determining whether pages within a site should be reindexed, rather than the broader question of whether the site itself should be reindexed. I've also split off the "Crawl from sitemap and/or RSS" thread to a separate issue #54 .

Two options to improve the efficiency of the current spider:

All the info required for indexing a site should be passed into the CrawlerRunner at the start of indexing for a site, so that no further database or Solr lookups are requried during indexing of a site. At the moment, all the config for indexing all sites is passed into the CrawlerRunner via common_config, and all the config for indexing a site is passed into the CrawlerRunner via site_config, but there is nothing passed in for page level config. Suggest a one-off lookup to get a list of urls with page_last_modified and/or etag values, and pass that into the CrawlerRunner via site_config or maybe even a new page_config.

The Solr query to get all pages on the michael-lewis.com domain with a page_last_modified set is /solr/content/select?q=:&fq=domain:michael-lewis.com&fq=page_last_modified:*

A check with the scrapy shell:

from scrapy import Request
req = Request('https://www.michael-lewis.com/posts/london-to-orkney-and-most-of-the-nc500-in-an-electric-car/', headers={"If-Modified-Since":"Fri, 25 Mar 2022 20:22:12 GMT"})
fetch(req)

Suggests a 304 response is returned with no content. Within SearchMySiteScript adding the following:

            headers = {"If-Modified-Since":"Fri, 25 Mar 2022 20:22:12 GMT"}
            yield Request(url, headers=headers)

gets a:

[scrapy.core.engine] DEBUG: Crawled (304) <GET https://michael-lewis.com/> (referer: None)
[scrapy.spidermiddlewares.httperror] INFO: Ignoring response <304 https://michael-lewis.com/>: HTTP status code is not handled or not allowed
[scrapy.core.engine] INFO: Closing spider (finished)

This indicates that the 304 response will be skipped, but the issue is that it means links on that page are not spidered either, so if the home page is not indexed then nothing else will be either. At present no internal link are recorded within the Solr entry so they can't be pulled from there. Workaround could be to not set the If-Modified-Since header where is_home=true. And of course if combined with #54 the impact would be minimised. The issue then would be that you would have to keep track of the skipped pages, so they wouldn't be deleted in the self.solr.delete(q='domain:{}'.format(spider.domain)) in close_spider, although that would mean there would still be the risk of useful but unreachable pages being deleted. You could start with all the pages in the current collection, but that would carry the risk of missing out on new pages, which is an important feature.

All of which brings us back to the idea of doing a not completely robust "incremental" reindex such as that described here more often, and the current robust "full" reindex less often, which would be a bigger change.

BTW, Scrapy does have a HttpCacheMiddleware with RFC2616 policy which could potentially simplify this implementation. However, by default it would cache all the requests on the file system, which could lead to a significant increase in disk space requirements, and hence cost. You can implement your own storage backend, so in theory you might be able to use Solr, although I suspect the format that the data is stored in would be different.

sbw commented 2 years ago

I'm learning a lot from following your thoughts on this.

I like the idea of a Solr query before starting the crawl to gather all pages on the site with Last-Modified (or Etag) values. I'm thinking a scrapy custom storage backend would deal with only those values and would not store the page content at all. I didn't get the impression the custom backend imposes a data format. Just a retrieve_response method that is recognizable as "this page is unchanged." (An empty body?)

You pointed out that'll make scrapy not crawl links in pages that respond 304. I think that's OK: The Solr query could return all pages, not just those with cache headers. As the scrapy crawl progresses, just remove each visited page from the list. Then invoke scrapy again on one of the pages that was not visited. Repeat until the list is empty.

Alas, for a site like mine that responds 304 to most pages, that'd mean you must invoke scrapy many times for each crawl. That'll introduce some overhead. (Unless you can submit a list of URLs to the initial scrapy invocation?)

I assume your code removes pages from the Solr index if they respond 404. Add code to leave the page unchanged in the Solr index on a 304 response.

Sorry to be talking so much without having absorbed all of the code. I may be off base as a result.

m-i-l commented 2 years ago

I'm currently letting Scrapy do the web crawling work, i.e. recursively requesting pages and extracting links, keeping track of what it has already requested and what it still has to do, handling deduplication, etc. Each crawl starts afresh, resulting in a relatively clean list of all the unique pages it has found (e.g. 404s aren't indexed). I then simply delete all the existing pages in the search index and insert the new ones to ensure the index is clean and up-to-date with no stale pages left behind. This keeps it fairly simple.

The risk with a new incremental index is that it could end up requiring a lot of custom web crawling code, e.g. to keep track of what was indexed this time, what was indexed last time, what needs to be added to the index, what needs to be removed from the index, etc., and even then with all the extra work it may result in a less clean index.

It could still be useful doing this extra work though, maybe not to improve the efficiency of the indexing as much, but more to check for added and updated pages more frequently than it does at the moment. For example, the more regular incremental index could work off its own list of pages to index rather than recursively spidering, to:

Then the less regular full reindex could:

A middle ground may be to investigate using the HttpCacheMiddleware with RFC2616 policy - that might be simpler to implement and good enough.

Seirdy commented 2 years ago

Some existing solutions:

Google supports WebSub, which is also a critical part of the IndieWeb. This way, updates can be pushed to Search My Site in real-time and no polling is needed.

Bing and Yandex use IndexNow. You don't have to participate in the IndexNow initiative but you could support the same API so people could re-use existing tools.

Finally, there's the option of polling Atom, RSS, and h-feeds. This should probably not be the default behavior, but something that authenticated users could opt in to.

m-i-l commented 2 years ago

@Seirdy , Many thanks for your suggestions. I would like to investigate WebSub and Index Now further, but it isn't a priority because many of the smaller static sites (which are the main audience) don't support it, and there isn't a pressing need for the latest up-to-the-minute content.

For now, my plan is to have two sorts of indexing: (i) the current periodic full reindex (spidering the whole site to ensure all content is indexed, that moved/deleted pages are removed etc.), and (ii) a new much more frequent incremental index (just requesting the home page, RSS, and sitemap and indexing new/updated links it finds on those). I've got as far as identifying the RSS and sitemaps (see also #54 ), but don't think I'll get chance to fully implement the incremental reindex for another month or two.

m-i-l commented 2 years ago

I've written and deployed the code to perform an incremental index, but not had chance yet to write the code to trigger the incremental index from the scheduler, so it is there but not in use.

Both types of indexing use the same SearchMySiteSpider with the indexing type passed in via the site config, and it has been implemented with out-of-the-box scrapy extensions (with the exception of the actual feed parsing which uses feedparser), to keep things relatively simple. In summary:

Full reindex:

  1. Start on home page, and web feed (RSS/Atom) if present (see also #54 for info on how the crawling from a web feed has been implemented).
  2. Get all links from the start page(s).
  3. Follow these links to index pages and recursively look for new links on the domain (until the page limit or time limit is reached).
  4. On load into Solr, delete all existing pages on the domain, to clean up moved and deleted content.

Incremental reindex:

  1. As full reindex step 1.
  2. As full reindex step 2, except: only get new links, i.e. links from the start page(s) which aren't already in the index.
  3. Do not follow links to look for new links, i.e. only index the set of new links identified in step 2.
  4. As full reindex step 4, except: do not delete existing pages.

When the scheduling is implemented, the plan is for:

This should make the list of pages which are in web feeds reasonably up-to-date, which will be useful for some of the functionality spun off from #71 .

m-i-l commented 2 years ago

This has now been fully implemented. The full reindexing frequency and incremental reindexing frequency are shown at https://searchmysite.net/admin/add/ . Once it has settled down and the impact of the more frequent indexing is clear, the hope is to get incremental reindexing for basic and free trial to every 7 days and for full listings to every day.

It is implemented via new database fields which store both the last_index_completed time and the last_full_index_completed time, and the indexing type (full or incremental) is determined with a

CASE
    WHEN NOW() - d.last_full_index_completed > d.full_reindex_frequency THEN TRUE
    WHEN NOW() - d.last_index_completed > d.incremental_reindex_frequency THEN FALSE
END AS full_index

where full_index is TRUE for a full reindex and FALSE for an incremental reindex (if both would be triggered the full index is picked first).

Note that the current indexing_page_limit may looks like it is limiting the usefulness of the incremental reindex for basic listings, because most basic listings have hit the indexing_page_limit so will not perform an incremental reindex. Even if allowing each incremental reindex to go slightly over the limit, that is likely just to add some random pages rather than newly added pages, because the pages already in the index probably won't be the most recent (the spider does not find links in a particular order, e.g. newest links first, and sometimes even detects links in a different order between runs, because there are many threads running concurrently).