:external_links checker sends too many requests to the same hosts

nanoc / nanoc

A powerful web publishing system

http://nanoc.app

MIT License

2.1k stars 243 forks source link

:external_links checker sends too many requests to the same hosts #1418

Open da2x opened 5 years ago

da2x commented 5 years ago

Steps to reproduce

Create some items that links multiple times to different websites. (item1 links ten times to example.com, item2 links ten times to example.org, etc.)
Runnanoc check elinks

Actual behavior

Nanoc sends requests in the order the links where discovered and ends up sending multiple requests to the same origin in a short amount of time. This can also happens when frequently linking to the same website from multiple items.

Results in HTTP 429 Too Many Requests rate-limiting responses.

Expected behavior

Nanoc should avoid sending repeated requests to the same origin multiple times in a row. Introducing a delay between requests is undesirable, but shuffling all the collected links from all items in a way that spreads out requests to the same origins would be great. Not sure how to best do that, though.

denisdefreyne commented 5 years ago

I think shuffling links (with Array#shuffle) would be a good first step.

A HTTP 429 Too Many Requests could include a Retry-After, which Nanoc could use to figure out how long to wait before retrying requests.

denisdefreyne commented 5 years ago

Alternatively, it might be interesting to let Nanoc delegate to @gjtorikian’s html-proofer. I don’t know how feature-complete its link checker is, but I’m all for reusing rather than rebuilding.

iay commented 5 years ago

I've been having some similar problems with my site; my approach so far as been to either link to sites that are less picky, or disabling the check in configuration. I'm honestly not sure whether in the cases I've seen that spacing things out would make much difference... the whole process doesn't take all that long.

Longer term, I think it might be worth refactoring the link checker to allow user-defined strategy to be injected. I don't recall if the current code ignores duplicates or not, but that would be the sort of canonical example of what I have in mind.

da2x commented 5 years ago

This strategy should work well. I’ve got no idea how much this would impact performance, though. —and I don’t volunteer to implement it. 😛

Sort all links into arrays by origin (see below)
Pop one item from the first array and check it
Pop one item from the next array and check that
Repeat 3 until you reach the end
Go to 2

{
  'https://example.com': [
    '/document1',
    '/document2'
  ],
  'http://example.org': [
    '/document3',
  ],
  'https://example.net': [
    '/document4',
    '/document5',
    '/document6'
  ]
}

This should check in order (the spread improves with more links and origins):

https://example.com/document1
http://example.org/document3
https://example.net/document4
https://example.com/document2
https://example.net/document5
https://example.net/document6

denisdefreyne commented 5 years ago

Or, with Ruby magic, #zip and #flatten:

[1, 11, 111].zip([2, 22, 222], [3, 33, 333], [4, 44, 444]).flatten
# => [1, 2, 3, 4, 11, 22, 33, 44, 111, 222, 333, 444]

Edit: Ugh, won’t work with arrays of variable length — ignore me.

gjtorikian commented 5 years ago

I don’t know how feature-complete its link checker is, but I’m all for reusing rather than rebuilding.

For this particular problem I can say that it uniqs any repeated items, but considers things like # endings to be unique.

This strategy should work well.

Yup, internally, this is exactly how html-proofer works.

da2x commented 5 years ago

Yup, internally, this is exactly how html-proofer works.

It caches test results? For how long? That is really interesting to ensure good crawler behavior. Is it smart enough to set longer caching times for [2|3|4]xx responses and shorter for 5xx?

gjtorikian commented 5 years ago

https://github.com/gjtorikian/html-proofer/#configuring-caching

Hoping off this thread for now -- please redirect html-proofer questions to that repo ✌️