pulibrary / pdc_describe

Description application for Research Data content
7 stars 1 forks source link

Globus outages #1740

Closed carolyncole closed 2 months ago

carolyncole commented 6 months ago

Commands as of 04-30-2024

Run during the outage when the endpoint is not responsive: 

curl -vk --resolve f0ad1.36fe.data.globus.org:443:127.0.0.1 https://f0ad1.36fe.data.globus.org/api/info
curl -vk --resolve f0ad1.36fe.data.globus.org:443:44.197.15.236 https://f0ad1.36fe.data.globus.org/api/info
sudo --user=www-data curl --unix-socket /run/gcs_manager.sock http://f0ad1.36fe.data.globus.org/api/info
ls -Zlah /run/gcs_manager.sock
systemctl -l --no-pager status gcs_manager.socket
systemctl -l --no-pager status gcs_manager.service
systemctl -l --no-pager status apache2.service
systemctl -l --no-pager status gcs_manager_assistant.service
systemctl -l --no-pager status globus-gridftp-server.service
sudo netstat -tpn | grep -i grid
sudo free -m
sudo dmesg -T
sar -A -f /var/log/sysstat/sa$(date '+%d')
sar -A -f /var/log/sysstat/sa$(date -d "yesterday" '+%d')
sar -A -f /var/log/sysstat/sa15

Run after the outage when the endpoint is back up (sudo /sbin/reboot brings the system back up): 

journalctl --no-pager --since="1 day ago" --unit gcs_manager.socket
journalctl --no-pager --since="1 day ago" --unit gcs_manager.service
journalctl --no-pager --since="1 day ago" --unit apache2.service
journalctl --no-pager --since="1 day ago" --unit gcs_manager_assistant.service
journalctl --no-pager --since="1 day ago" --unit globus-gridftp-server.service 

Send current logs from [other_vhosts_access, access.log, error.log] from/var/log/apache2/ && [gcs.log, gridftp.log] /var/log/globus-connect-server/gcs-manager. Note: Send all of the logs for the day of the incident.

Historical commands before 04-30-2024

When the error occurs again run the following commands and send the output to globus

curl -vk --resolve f0ad1.36fe.data.globus.org:443:127.0.0.1 https://f0ad1.36fe.data.globus.org/api/info
curl -vk --resolve f0ad1.36fe.data.globus.org:443:44.197.15.236 https://f0ad1.36fe.data.globus.org/api/info
systemctl -l --no-pager status gcs_manager.service
systemctl -l --no-pager status apache2.service
journalctl --no-pager --unit gcs_manager.service
journalctl --no-pager --unit apache2.service

They also want the logs

-We'll also want to get the gcs.log file and GridFTP log file that covers the event.

-We'll further want to get the Apache logs covering the event - we explain where to find these here:

https://docs.globus.org/globus-connect-server/v5.4/troubleshooting-guide/#globus-logging-locations

I turned on debug logging according to their instructions~~

That could point to an issue with httpd, the GCS Manager service, or even networking.

We'll want to get more information to try to narrow down the scope. We'll first want to enable debug logging, as discussed in our doc here:

https://docs.globus.org/globus-connect-server/v5.4/troubleshooting-guide/#obtaining-debug-log-events

We'll want to reproduce the behavior - if you have a means to reproduce - or wait for it to reoccur. While the behavior is occurring, we'll want to get the output of the following commands run on the pdc-globus-prod-postcuration system hosting the endpoint:

Full Email response


Ticket #383270: Our Globus connect server continues to hang  

http://support.globus.org/hc/requests/383270

Daniel Powers, Mar 5, 2024, 14:29 CST:
Hi Carolyn,

I see.

That could point to an issue with httpd, the GCS Manager service, or even networking.

We'll want to get more information to try to narrow down the scope. We'll first want to enable debug logging, as discussed in our doc here:

https://docs.globus.org/globus-connect-server/v5.4/troubleshooting-guide/#obtaining-debug-log-events

We'll want to reproduce the behavior - if you have a means to reproduce - or wait for it to reoccur. While the behavior is occurring, we'll want to get the output of the following commands run on the pdc-globus-prod-postcuration system hosting the endpoint:

curl -vk --resolve f0ad1.36fe.data.globus.org:443:127.0.0.1 https://f0ad1.36fe.data.globus.org/api/info
curl -vk --resolve f0ad1.36fe.data.globus.org:443:44.197.15.236 https://f0ad1.36fe.data.globus.org/api/info
systemctl -l --no-pager status gcs_manager.service
systemctl -l --no-pager status apache2.service
journalctl --no-pager --unit gcs_manager.service
journalctl --no-pager --unit apache2.service

-We'll also want to get the gcs.log file and GridFTP log file that covers the event.

-We'll further want to get the Apache logs covering the event - we explain where to find these here:

https://docs.globus.org/globus-connect-server/v5.4/troubleshooting-guide/#globus-logging-locations

Please let us know what you find.

-Regards

Dan Powers
hectorcorrea commented 6 months ago

See also https://github.com/pulibrary/princeton_ansible/issues/4719

carolyncole commented 4 months ago

Running today...

journalctl --no-pager --since="2024-4-28" --unit gcs_manager.service
journalctl --no-pager --since="2024-4-28" --unit apache2.service
journalctl --no-pager --since="2024-4-28" --unit gcs_manager_assistant.service
journalctl --no-pager --since="2024-4-28" --unit globus-gridftp-server.service
carolyncole commented 4 months ago

We are thinking that having a second EC2 instance for our globus endpoint so that we can have fail over. It should also give us more stability. Look into load balancing with AWS and or load balancing on Globus for the endpoint. This should be looked into in June or July 2024. For now we will continue to try and work with globus support.

carolyncole commented 2 months ago

We just got some feedback from the developers


Do you expect the majority of traffic to be via the HTTPS interface?  That is what I see -- only a few tasks from the Globus service.  Additionally, since the collection allows public/anonymous access, there are bursts of traffic from web crawlers.

From the logs, it looks there may be a delay in cleaning up a local transfer process after the transfer is complete, or perhaps when it is cancelled by the client.  These processes build up (especially with the load from crawlers), and eventually apache cannot start new ones and you see the hanging behavior.  This is unexpected and probably a bug in our S3 connector -- I'll try to reproduce and figure out a solution.

To resolve this in the short term without a reboot, you should be able to shut down apache2 and globus-gridftp-server, ensure that additional globus-gridftp-server processes are killed (it is safe to killall -9 if any remain at that point), wait ~2 minutes for network queues to time out, and then restart both services.    The gcs_manager* processes aren't involved at this point.

You may also want to add a [robots.txt](https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsearch.gov%2Findexing%2Frobotstxt.html&data=05%7C02%7Ccarolyn.cole%40princeton.edu%7Cfe3d0579549243e3260508dc92136214%7C2ff601167431425db5af077d7791bda4%7C0%7C0%7C638545861441062495%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=2gjOExDhjoWv%2FML8z2iNyRuFNvAIeKzt%2Be9sHlh6xx0%3D&reserved=0) file to the root of your collection to try to limit the crawlers -- this of course relies on the crawlers being well-behaved, but I see in your logs that they are at least checking for that file.

Mike
hectorcorrea commented 2 months ago

Ran these commands to restart it just right now:

sudo systemctl restart apache2
sudo systemctl restart globus-gridftp-server

(didn't collect any stats since I was on my way out)

bess commented 2 months ago

We believe this has been addressed by:

  1. Globus believes they have a bug in the AWS connector and is addressing it
  2. In the meantime @kayiwa has put monit in place to restart after crashes