microsimulation / ijm

A central place for general issues, documents, scripts and resources for the IJM
https://microsimulation.org/ijm/
MIT License
4 stars 0 forks source link

Website down #160

Closed pbronka closed 1 year ago

pbronka commented 1 year ago

Hi @BlueReZZ

The journal's website seems to be down since Friday with the 504 Bad Gateway error. I don't know how to check the status of the server / don't think I have access to the server hosting the website - could you help us investigate this?

Thank you, Patryk

pbronka commented 1 year ago

I found the EC2 instance in the AWS dashboard. It seems to be due to a problem with the AWS itself as both status checks are failing. This instance is also scheduled to terminate in 13 days due to degraded hardware.

I tried rebooting the instance from the AWS console, but it didn't help. This link (https://aws.amazon.com/premiumsupport/knowledge-center/ec2-windows-system-status-check-fail/) suggests that stopping and starting the instance would move it to a different server and possibly fix both these issues, but I'm not sure if it won't cause problems elsewhere, e.g. because it changes the IP. Do you know if it would be ok to stop and restart this EC2 instance?

Edit: I tried stopping the instance and was able to start it again, but the website still doesn't respond.

BlueReZZ commented 1 year ago

Hi Patryk,

I'll speak to @gnott about this when he's online and will take a look myself to see if there's anything obvious.

Paul

BlueReZZ commented 1 year ago

Hi @pbronka ,

We've spent time this afternoon looking into the problem and found it's deeper than we thought. We've attempted to look at the logs on the server itself but cannot connect. Thinking that a new deployment might help, we also attempted that but the deployment itself cannot connect to the server. This suggests we may need to completely recreate the infrastructure which wasn't something that @thewilkybarkid and I had done before, but helpfully, @erkannt is back in the office tomorrow and he may be able to help with this.

At the moment there's not much more we can do so will revisit this tomorrow with @erkannt's assistance.

Paul

pbronka commented 1 year ago

Thank you very much for looking into this, please let me know if we can be useful in any way.

BlueReZZ commented 1 year ago

Hi @pbronka,

The website is back up and running as of 11:10am UTC this morning.

The problem seemed to be with the application on the server but we were not able to connect to the server to ascertain the exact problem. In restarting the machines a new IP address was assigned as you'd mentioned so our initial attempts at redeploying failed as the IP address had changed. The infrastructure code suggests that there is a static IP address so we didn't expect this to be the case. Anyway, in finding the correct IP address and adding this to the GitHub secrets that are used by the deployment Action to find the right server, the deployment was able to upload a clean version of the website, and this has solved the problem.

Thanks to @erkannt and @thewilkybarkid for their help, the website appears to be working normally.

We suspect the cause may have been the degraded machine that you were emailed about, then the remediation for that (rebooting the machine) didn't bring the website back up as expected, hence the deployment fixed it.... once we were able to actually deploy.

Paul

pbronka commented 1 year ago

Thank you very much. Do you think this is something we could handle if it happened again in the future if you outlined the steps you followed to fix this?

BlueReZZ commented 1 year ago

Yes, I thought I should write up the steps in the existing infrastructure documentation. It should not be too difficult to do with the credentials you have.

Paul

On Wed, 25 Jan 2023 at 11:27, pbronka @.***> wrote:

Thank you very much. Do you think this is something we could handle if it happened again in the future if you outlined the steps you followed to fix this?

— Reply to this email directly, view it on GitHub https://github.com/microsimulation/ijm/issues/160#issuecomment-1403464624, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAENEQJAWIIM6IYACS3SJKLWUEE2DANCNFSM6AAAAAAUDWKJBA . You are receiving this because you modified the open/close state.Message ID: @.***>

--

elifesciences.org https://elifesciences.org

eLife Sciences Publications, Ltd is a limited liability non-profit non-stock corporation incorporated in the State of Delaware, USA, with company number 5030732, and is registered in the UK with company number FC030576 and branch number BR015634 at the address Westbrook Centre, Milton Road, Cambridge, CB4 1YG.

-- From February, eLife's peer-review process is changing. Find out more https://elifesciences.org/inside-elife/54d63486?utm_source=staff&utm_medium=email&utm_campaign=PRC_Launch_Oct22.