Closed carolyncole closed 5 months ago
Running today...
journalctl --no-pager --since="2024-4-28" --unit gcs_manager.service
journalctl --no-pager --since="2024-4-28" --unit apache2.service
journalctl --no-pager --since="2024-4-28" --unit gcs_manager_assistant.service
journalctl --no-pager --since="2024-4-28" --unit globus-gridftp-server.service
We are thinking that having a second EC2 instance for our globus endpoint so that we can have fail over. It should also give us more stability. Look into load balancing with AWS and or load balancing on Globus for the endpoint. This should be looked into in June or July 2024. For now we will continue to try and work with globus support.
We just got some feedback from the developers
Do you expect the majority of traffic to be via the HTTPS interface? That is what I see -- only a few tasks from the Globus service. Additionally, since the collection allows public/anonymous access, there are bursts of traffic from web crawlers.
From the logs, it looks there may be a delay in cleaning up a local transfer process after the transfer is complete, or perhaps when it is cancelled by the client. These processes build up (especially with the load from crawlers), and eventually apache cannot start new ones and you see the hanging behavior. This is unexpected and probably a bug in our S3 connector -- I'll try to reproduce and figure out a solution.
To resolve this in the short term without a reboot, you should be able to shut down apache2 and globus-gridftp-server, ensure that additional globus-gridftp-server processes are killed (it is safe to killall -9 if any remain at that point), wait ~2 minutes for network queues to time out, and then restart both services. The gcs_manager* processes aren't involved at this point.
You may also want to add a [robots.txt](https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsearch.gov%2Findexing%2Frobotstxt.html&data=05%7C02%7Ccarolyn.cole%40princeton.edu%7Cfe3d0579549243e3260508dc92136214%7C2ff601167431425db5af077d7791bda4%7C0%7C0%7C638545861441062495%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=2gjOExDhjoWv%2FML8z2iNyRuFNvAIeKzt%2Be9sHlh6xx0%3D&reserved=0) file to the root of your collection to try to limit the crawlers -- this of course relies on the crawlers being well-behaved, but I see in your logs that they are at least checking for that file.
Mike
Ran these commands to restart it just right now:
sudo systemctl restart apache2
sudo systemctl restart globus-gridftp-server
(didn't collect any stats since I was on my way out)
We believe this has been addressed by:
Commands as of 04-30-2024
Run during the outage when the endpoint is not responsive:
Run after the outage when the endpoint is back up (
sudo /sbin/reboot
brings the system back up):Send current logs from [other_vhosts_access, access.log, error.log] from
/var/log/apache2/
&& [gcs.log, gridftp.log]/var/log/globus-connect-server/gcs-manager
. Note: Send all of the logs for the day of the incident.Historical commands before 04-30-2024
When the error occurs again run the following commands and send the output to globus
They also want the logs
I turned on debug logging according to their instructions~~
Full Email response