sciencehistory / scihist_digicoll

Science History Institute Digital Collections
Other
12 stars 0 forks source link

reduce logging bytes: by reducing crawls? #1306

Closed jrochkind closed 3 years ago

jrochkind commented 3 years ago

Our app is exceeding our papertrail plan's logging capacity (250MB), when just a few weeks ago it wasn't.

What's changed? Mainly just that it is logging a lot more events. I think this is probably from bots/crawlers, although we dont' have the info to be sure at present.

Since papertrail archival files come in an hour block, I started out just comparing two one-hour samples. One from when our log sizes were much smaller, one from more recently when were exceeding papertrail capacities. I made sure to pick hours before papertrail's end-of-day limits kicked in and stopped logging.

Monday Aug 23 10am ET (smaller logs), and Monday Sep 6 10am ET (big logs).

8123 total requests vs 20393 total requests -- a lot more requests! What are they?

Top requests:

aug 23:

0036: /catalog.html 0060: /hirefire/9d023e59-b030-4ccd-a9db-389556f74044/info 0101: /catalog/range_limit_panel/year_facet_isim 0101: /catalog/opensearch.xml 0143: /catalog/range_limit 0144: / 0247: /catalog/facet/subject_facet 0662: /admin/queues/overview.poll 4115: /catalog

Sep 6

0059: /focus/water/facet 0060: /hirefire/9d023e59-b030-4ccd-a9db-389556f74044/info 0127: /catalog/facet/creator_facet 0138: / 0159: /collections/4x51hj94t 0340: /focus/women-and-science 0376: /catalog/range_limit_panel/year_facet_isim 0777: /focus/water 0905: /catalog/facet/subject_facet 2129: /catalog/range_limit 12716: /catalog

OK, the extra hits are mostly to /catalog endpoint, our search results.

What do /catalog results look like

What do they look like, do they have a text query, facets? Drilling down just into the larger Sep 6 set, of 12716 /catalog hits, only 240 have a q param. 12463 (almost all of them) have facet params but no q text query.

From this I'm guessing they are mostly crawler bots that are just crawling all our pages, by clicking on every possible facet combination in the search results -- not really a great use of the crawl.

We aren't currently logging user-agent, so can't be sure they are crawlers, but this is my guess: The crawler traffic just went up a lot in the past few weeks, and MOST of our traffic is crawler traffic.

Suggested mitigation

  1. log user agents. Unfortunately this will increase the bytes in our log files! -- instead of logging the entire user agent (for historical purposes these are REALLY BIG like Mozilla/5.0 (Linux; Android 5.0; SM-G920A) AppleWebKit (KHTML, like Gecko) Chrome Mobile Safari (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html)) -- to save space we can use a tool to parse the user-agent and log a more compact representation like GoogleBot/Android-5/Safari

  2. robots.txt to suppress some of the especially useless and easily suppressable URLs like /catalog/facet ("more facets") link and /catalog/range_limit.

  3. Add "nofollow" to facet links, hoping this discourages bots from indexing them all. There is mixed info on whether it will, but seems worth a try. If it doesn't work, we could try being even more aggressive in robots.txt, but I don't necessarily want to exclude all results pages from ever being indexed, just keep crawlers from clicking every combination of facet limits!

Things considered but not for now:

jrochkind commented 3 years ago

Trying to get nofollow on our facet nav links with a PR to blacklight: https://github.com/projectblacklight/blacklight/pull/2486

But hard to predict when/if we can get that merged and released to BL.

Our papertrail logging has already been paused for the day for excessive bytes, and I only deployed the user-agent logging this morning (which probably increased log size).

Tomorrow I"ll analyze logs with user agents... if we find that a huge portion of our /catalog requests really are bots, will consider trying to keep them out with robots.txt immediately, while waiting for nofollow solution. I prefer nofollow solution if possible, we dont' really want to ban all search results pages from ever being in (eg) Google results, we just want to let crawlers know it's pretty useless to crawl them all from the facet nav links!

jrochkind commented 3 years ago

It's hard to compare apples to apples here (especially because papertrail archived logs take ~4-6 hours to appear, and we just deployed some changes yesterday).

But looking at logs from 9/9 8AM ET (latest available), it looks to me like:

Now that we're logging user-agents, let's look at what proportion of /catalog result requests look like bots...

I think it probably does make sense to prohibit search results in robots.txt, while waiting for nofollow in BL. Maybe prohibit anything in /catalog with f%5B in it, any results with a facet limit? I think this is possible in robots.txt using it's very simple pattern matching.

eddierubeiz commented 3 years ago

Prohibiting search results in robots.txt sounds great to me.

jrochkind commented 3 years ago

More restrictive robots.txt attempting to prohibit search results with facet limits, deployed approx 1pm September 9th -- for interpreting future logs and if this had an effect. Spiders can take a while to pick up new robots.txt -- or not, the one we deployed yesterday morning seemed like it took effect pretty quickly, we stopped getting those requests.

We'll see how things look over the weekend.

jrochkind commented 3 years ago

OK, after robots.txt to try to exclude crawlers from crawling every possible facet limit combo, our traffic has indeed dropped drastically.

On Sep 12 for the hour beginning 10AM ET,

Total requests: 4743 (23% of same our on Sep 6).

Top requests:

0005: /focus/women-and-science 0009: /collections/wh246s128 0010: / 0013: /oai 0014: /robots.txt 0020: /collections/gt54kn818 0152: /catalog

With a very long tail of requests to indiviudal work pages, as well as image downloads like /downloads/pk02cb25z/download_small -- I think we're fine with crawlers crawling those!

Of the /catalog requests, only 74 (~50%) have no 'q' but do have 'facet' limits. (Of those with a q, all but one are the Honeybadger Uptime checker searching for "chemistry").

Of the /catalog requests, 49 (~33%) are identified as bots -- but we may have other bot requests not succesfully identified as bots, as the honeybadger uptime checker actually isn't.

I do still see a few catalog-facet-limit requests that are sneaking through -- ones that just have a range limit but not other facet limit use a different URL query param, and aren't caught by our recent robots.txt exclusions. I'm going to add an exclusion for that one too, since I'm noticing it.

But generally, this is looking good, and I think has achieved the objective -- we should not have a problem with papertrail rate limits. Papertrail is currently only showing about ~50MB a day of usage -- our limit is 250MB. I think it still makes sense to keep the current papertrail plan ($33/month), the cost is reasonable, gives us buffer to handle any spikes, and is the cheapest plan with 14-day online retention.

jrochkind commented 3 years ago

Oh, and for that hour Sep 12 10am, 1811 of 4743 total requests (38%) were identified as bots. The actual number is probably higher than that (some bots are missed), would require more analysis to say how high.

jrochkind commented 3 years ago

Yep, I'm going to call this Issue "done".