ioos / ioosngdac

IOOS National Glider Data Assembly Center (V2)
https://ioos.github.io/ioosngdac/
8 stars 18 forks source link

ERDDAP availability issues #168

Closed kerfoot closed 1 year ago

kerfoot commented 1 year ago

I've received a few (6 or so) cron error emails over the last 24 hours for the script I've been running hourly to fetch the status page and use it as a proxy for server availability.

I looked in my log files and found the following html response codes:

408 429

429 is Too Many Requests, so I'm not super concerned about that one. 408 is Request Timeout, which is a bit more concerning. I suggest that we just keep an eye on these errors and address solutions if they persist of become more numerous.

Not sure if this is related, but the Minor Load Datasets events are taking > 1 hour to complete. See the #load-datasets-events-time-distributions on the ERDDAP status page.

I don't think the discussion @kerfoot and @benjwadams had with Bob Simons regarding the subsetVariables loading is causing this long load time. I loaded (on dev) both the original data set:

https://dev-gliders.ioos.us/erddap/tabledap/maracoos_01-20200820T1459.subset

and the "fixed" data set:

https://dev-gliders.ioos.us/erddap/tabledap/maracoos_01_v2-20200820T1459.subset

and the load times for both were pretty much identical.

Nevertheless, there seems to be an issue with the data sets contained on the production public erddap that is resulting in significantly long load times, especially when restarting the server. We need to continue to try and diagnose the issue so that restarts do not prevent data set access for hours.

kerfoot commented 1 year ago

I just received an email from a DAC ERDDAP user saying that their request was rejected due to a 429 error (Too many requests).

This is becoming an issue that requires immediate response.

benjwadams commented 1 year ago

Newer ERDDAPs ships with a feature that is enabled by default that prevents too many requests to help with DDoSing. I will check logs, but I suspect that this is due to that. We are also reverse proxying requests, so ERDDAP might be seeing the same IP for the reverse proxies. I'll check logs, but I suspect that this might be related.

benjwadams commented 1 year ago

There is currently very heavy I/O load on the main server, which will slow processing of requests. I'm looking into why there is such tremendous I/O on the server, in addition to the 429 and other HTTP errors encountered.

benjwadams commented 1 year ago

408:

72014816 FAILURE. TIME=120069ms (>10s!)

*** sendErrorCode 408 for request #72014816: /erddap/tabledap/sg249-20220712T1922.csv?time,latitude,longitude,profile_id,wmo_id java.util.concurrent.TimeoutException: Timeout waiting for your other requests to process. Please make just one request at a time.

benjwadams commented 1 year ago

The previous log.txt archival got clobbered because there are so many logs being written right now. However, 429 was returning against a Rutgers IP of 165.230.166.171, indicating too many requests were being made:

nslookup 165.230.166.171
171.166.230.165.in-addr.arpa    canonical name = 171.166.230.165.reverse.marine.rutgers.edu.
171.166.230.165.reverse.marine.rutgers.edu      name = arctic-new.marine.rutgers.edu.

Authoritative answers can be found from:

This rules out the possibility that reverse proxy IPs were the cause of this issue.

benjwadams commented 1 year ago

Found a relevant excerpt for the 429 with a Rutgers IP:

{{{{#72042611 2022-09-21T22:00:16+00:00 (notLoggedIn) 54.85.102.46 GET /erddap/tabledap/ng412-20200714T1210.dods?s.profile%5fid
{{{{#72042612 2022-09-21T22:00:16+00:00 (notLoggedIn) 165.230.166.171 GET /erddap/search/advanced.csv?page=1&itemsPerPage=1000000.0&protocol=tabledap&cdm_data_type=(ANY)&institution=(ANY)&ioos_category=(ANY)&keywords=(ANY)&long_name=(ANY)&standard_name=(ANY)&variableName=(ANY)&minLon=(ANY)&maxLon=(ANY)&minLat=(ANY)&maxLat=(ANY)
OutputStreamFromHttpResponse charEncoding=, encoding=identity, ng412-20200714T1210_b761_edc3_a0eb, .dods
ng412-20200714T1210.handleViaFixedOrSubsetVariables got the data from the distinctSubset file.  time=0ms
}}}}#72042611 54.85.102.46 SUCCESS. TIME=0ms

{{{{#72042613 2022-09-21T22:00:17+00:00 (notLoggedIn) 104.211.42.252 GET /erddap/tabledap/ce_382-20180907T0000-delayed.html
OutputStreamFromHttpResponse charEncoding=UTF-8, encoding=gzip, ce_382-20180907T0000-delayed_5ace_8896_cba2, .html
}}}}#72042613 104.211.42.252 SUCCESS. TIME=51ms

{{{{#72042614 2022-09-21T22:00:17+00:00 (notLoggedIn) 165.230.166.171 GET /erddap/search/advanced.csv?page=1&itemsPerPage=1000000.0&protocol=tabledap&cdm_data_type=(ANY)&institution=(ANY)&ioos_category=(ANY)&keywords=(ANY)&long_name=(ANY)&standard_name=(ANY)&variableName=(ANY)&minLon=(ANY)&maxLon=(ANY)&minLat=(ANY)&maxLat=(ANY)
{{{{#72042615 2022-09-21T22:00:17+00:00 (notLoggedIn) 165.230.166.171 GET /erddap/search/advanced.csv?page=1&itemsPerPage=1000000.0&protocol=tabledap&cdm_data_type=(ANY)&institution=(ANY)&ioos_category=(ANY)&keywords=(ANY)&long_name=(ANY)&standard_name=(ANY)&variableName=(ANY)&minLon=-90.0&maxLon=-15.5&minLat=0.0&maxLat=48.0&minTime=1654041600.0&maxTime=1669766400.0
*** lowSendError for request #72042612: isCommitted=false fullMessage=
Error {
    code=429;
    message="Too Many Requests: Please make just one request at a time.";
}

*** lowSendError for request #72042615: isCommitted=false fullMessage=
Error {
    code=429;
    message="Too Many Requests: Please make just one request at a time.";
}

@BobSimons, is this part of DoS protection that certain IPs can be allowlisted and this will go away? @kerfoot, are the scripts Rutgers is waiting for the responses or running asynchronously? The latter may cause this to happen under conditions of heavy load, which the server is currently experiencing.

benjwadams commented 1 year ago

Yeah, here's the section in setup.xml I was thinking of:

<ipAddressMaxRequests></ipAddressMaxRequests>  <!-- current default=7 -->                                                                                                                                                                                                       
<ipAddressMaxRequestsActive></ipAddressMaxRequestsActive>  <!-- current default=2 -->                                                                                                                                                                                           
<ipAddressUnlimited></ipAdressUnlimited>

Any way to change this and reload the config without requiring a restart?

BobSimons commented 1 year ago

All three of the ipAddress... tags belong in datasets.xml, not setup.xml. Changes to their values are applied every time ERDDAP reads datasets.xml -- you don't need to restart ERDDAP.

kerfoot commented 1 year ago

Update:

@benjwadams can fill in the details, but today (2022-09-22) the system was shut down and the AWS instance was upgraded. This will result in a change in operating costs but, so far, it seems worth it. @benjwadams : what is the name/class of the new instance?

ERDDAP (v2.18, needs to be upgraded) was restarted. We're still not sure why (potentially the cleaning up of dataset subsetVariables), but ERDDAP restarted in ~5 minutes. This is a significant decrease in previous start up times which were > 1 hr.

Thanks to @benjwadams for implementing the changes and @BobSimons for the suggestions.

I'm going to close this issue and will continue to closely monitor things...