openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
130 stars 37 forks source link

Die if the upstream server is not reachable #142

Open ghost opened 3 years ago

ghost commented 3 years ago

@kevinmcmurtrie commented on Feb 15, 2021, 5:27 AM UTC:

Problem

Scraper was rapidly hitting http://aleph.gutenberg.org while it was refusing TCP/IP connections. This resembles a DoS attack and could result in clients being blacklisted.

https://farm.openzim.org/pipeline/806bba557688d39ae7189206/debug

Reproducing steps

N/A.

This issue was moved by kelson42 from openzim/zimfarm#607.

ghost commented 3 years ago

@kevinmcmurtrie commented on Feb 25, 2021, 7:31 AM UTC:

It happened again: https://farm.openzim.org/pipeline/97fbc40e1f235962f06ae206/debug

I disabled the gutenberg scraper on pixelmemory.

ghost commented 3 years ago

@kelson42 commented on Feb 25, 2021, 9:46 AM UTC:

Scraper was rapidly hitting http://aleph.gutenberg.org while it was refusing TCP/IP connections.

What let you say the aleph.gutenberg.org refuses TCP/IP connections? What let you believe this is a problem in the scraper? How should the scraper behave differently?

ghost commented 3 years ago

@kevinmcmurtrie commented on Feb 25, 2021, 5:04 PM UTC:

Scraper was rapidly hitting http://aleph.gutenberg.org while it was refusing TCP/IP connections.

What let you say the aleph.gutenberg.org refuses TCP/IP connections?

Over 1 million Connecting to aleph.gutenberg.org (aleph.gutenberg.org)|65.50.255.20|:80... failed: Connection refused.

What let you believe this is a problem in the scraper? How should the scraper behave differently?

The scraper is retrying at an extremely high rate that is going to trigger automated denial-of-service detection. 1141028 connection attempts were made and refused in a short period of time. Retries need to be throttled to a sane rate.

kelson42 commented 3 years ago

@eshellman Does the configuration of aleph.gutenberg.org has changed? Do you have an idea what could be done?

eshellman commented 3 years ago

The service was repaired on the 21st; these error logs seem to be from the 20th. @kevinmcmurtrie Could you verify that there is still a problem?

kelson42 commented 3 years ago

@eshellman Thx, I plan to close the ticket as the problem does not occur anymore.

kevinmcmurtrie commented 3 years ago

Was retry throttling added or are you just hoping that aleph.gutenberg.org doesn't go down any more? I'm concerned about triggering DoS detection at my own ISP so I'm not going to run it as long as it potentially has this issue.

kelson42 commented 3 years ago

The questions are:

kevinmcmurtrie commented 3 years ago

If it could wait 2 seconds between retries of socket errors, that would be enough for me to let it run again.

eshellman commented 3 years ago

That seems like a good idea.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 commented 1 year ago

If slowing down is good enough, then woukd be great to implement it.

eriol commented 1 year ago

If adding an external dependency it's not a problem retrying¹ library is an easy way to accomplish it: it's already have exponential backoff as one of the possible strategies.

¹ https://pypi.org/project/retrying/

rimu commented 1 year ago

zimscraperlib does the file download, in download.py (def save_large_file). It uses wget with parameters that make it retry 5 times, including if connections are refused:

 subprocess.run(
        [
            WGET_BINARY,
            "-t",
            "5",
            "--retry-connrefused",
            "--random-wait",
            "-O",
            str(fpath),
            "-c",
            url,
        ],
        check=True,
    )

If we remove --retry-connrefused then it will only try once, not 5 times. So the problem will remain but it will be only 20% as big.

There is another parameter, --waitretry=5 which will add a 5 second delay when a download fails, before trying again.

rimu commented 1 year ago

Also, we could insert some delays in gutenberg/download.py. For example:

                if not download_file(url, zpath):
                    logger.error("ZIP file download failed: {}".format(zpath))
                    continue

becomes

                if not download_file(url, zpath):
                    logger.error("ZIP file download failed: {}".format(zpath))
                    time.sleep(1)
                    continue
rimu commented 1 year ago

I've made a PR on python-scraperlib, someone please take a look :)

rgaudin commented 1 year ago

I think retrying is probably the way to go here. That's what we do on other scrapers. We uses backoff but retrying seems like a better choice.

benoit74 commented 1 year ago

Could you develop why retrying seems better than backoff (just to share understanding)? Did someone saved the debug logs mentioned in this PR? I would like to have a look into it, I don't get why it could have performed 1 million requests in a short timeframe.

kevinmcmurtrie commented 1 year ago

Could you develop why retrying seems better than backoff (just to share understanding)? Did someone saved the debug logs mentioned in this PR? I would like to have a look into it, I don't get why it could have performed 1 million requests in a short timeframe.

It looks like the logs expire.

The scraper had the content list but the host with the content http://aleph.gutenberg.org/ was immediately refusing connections.

Connecting to aleph.gutenberg.org (aleph.gutenberg.org)|65.50.255.20|:80... failed: Connection refused.

The scraper was making article count * retries requests at full speed on a low-latency connection. It wasn't delaying retries or giving up after many failures. That's how it quickly had 1 million refused connections. It might have been more if I hadn't killed it.

It's important to me that Kiwix scrapers never look like a DoS attack or bad bot. When AT&T (my host), Akamai (edge cache), Cloudflare (edge cache), and various IDS see activity like this, it causes my network to end up in private blocklists. These private blocklists are a pain the the ass to debug and resolve. Stuff just stops working.

It's OK if the scraper retries a lot then quits or retries slowly. It just can't retry at full-throttle until it has gone through the whole content list.

Project Gutenberg had a similar issue recently where they accidentally set the wrong IPv6 DNS address. Luckily it didn't route to anything so it was a very slow failure.

benoit74 commented 1 year ago

This is very weird because it does not match my understanding of the wget behavior. Either documentation is not up-to-date with code, something weird is happening or I just don't get it right (and my favorite option is the third one).

Totally aligned with your requirements, even if I cannot ensure that it will be always feasible / working as expected. For iFixit scrapper I implemented the following behavior for every API call (but not everywhere, there is few web crawling calls, and this caused you an issue as well as far as I remember):

Could be done probably quite easily for Gutenberg scraper as well.

I will try to reproduce the current behavior locally to get a better understanding and ensure that my changes are really enhancing the situation.

rgaudin commented 1 year ago

@benoit74 I believe the problem is not the calls but in the fact that they are treated independently, blindly.

We are using a single target host and we have more than 130K resources for english alone.

If we only look at a request's response from its own resource perspective (just to retry it) then we'll still try the same failing server for as many resources we need times the retry policy... It's more of an architecture issue than a request call one.

Code should understand that the run is compromised by the server's current status and halt itself.

rimu commented 1 year ago

@rgaudin Yes, good point!

I have been experimenting with tracking overall success and failure of multiple download attempts in a dictionary. The downloads happen in concurrent threads so it is necessary to share a variable between them.

Here is my code: https://github.com/rimu/openzim-gutenberg/commit/fae7d9aa1e4fceec43469baf8440ff3730986192.

What do you think?

eriol commented 1 year ago

@rimu just a quick note: although using CPython we know that the GIL will serialize access to the dict (so it will not be corrupted) if we update the same key from 2 thread the result is undefined if we don't synchronize the threads.

kevinmcmurtrie commented 1 year ago

Really, really, really, really, really, really, shouldn't be using Python for multi-threaded work. I've seen companies try to use multi-threaded Python at large scales and it's a never-ending disaster because the language itself doesn't support it and external libraries can only pretend to support it.

Go or Java would be a far better choice if you want fairly high-level features and an architecture that was always designed for elegant multi-threading. You'd no longer need Redis or DNS caches. You'd no longer have problems with cloud cache latency, deadlocks, races, GIL locks, spinloops, etc. I'd even pitch in coding effort it's Java.

rimu commented 1 year ago

@eriol yes, I wondered if that would be a problem. In this case I don't think it matters if the count of success or failure are a little wrong sometimes, we're not flying to the moon - the only thing that can will happen is a few extra requests before the script terminates. But I can easily add some locking if necessary.

@kevinmcmurtrie I haven't added any threading to the architecture, it was already there when I arrived (and working well, presumably). All that is new is the sharing of a new variable between threads.

rgaudin commented 1 year ago

@kevinmcmurtrie, your input is important and duly noted ; it's not the first time you're sharing this with us. While it's an highly important point to our process, changing stack is not a light decision and this input would be one in a multitude of other criteria and constraints. So I'm moving this out of the scope of this particular ticket to the general openZIM strategy.

On @rimu's code, I think the approach we discussed here would be:

Key is to keep a shared state (using a lock) so that we can gracefully shut the process down, removing the temp stuff we've created.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.