norment / tsd_issues

Repo to track issues with TSD as tickets
2 stars 0 forks source link

/cluster/projects/p697 mount hangs on p697-appn-norment01 #64

Closed ofrei closed 3 years ago

ofrei commented 3 years ago

It seem that /cluster/projects/p697 mount hangs on p697-appn-norment01. I can use p697-submit to access the data on cluster. I can ssh p697-appn-norment01, and I can access tsd/p697/data/durable/ However I can not access /cluster/projects/p697 from p697-appn-norment01.

https://rt.uio.no/SelfService/Display.html?id=4249055

Sandeek commented 3 years ago

Issued resolved and logged the incident in https://docs.google.com/forms/d/e/1FAIpQLSfyQtSd3intuKkb5O4hmmPq5UzX6EhuCk95ovNfHULc7DIBKg/viewform

ofrei commented 3 years ago

@Sandeek It seem we have the same issue on p697-appn-norment01. Could you double-check and fix as before? If it happens again it is possible to investigate further? I think Sabry had some insights, either this was related to lack of space on /tmp folder or some other things...

idaElken commented 3 years ago

I have the same issue. on p697-appn-norment01

ofrei commented 3 years ago

@idaElken as a workaround you could use p697-submit or p697-submit2 machines - they work fine for me as of now

idaElken commented 3 years ago

@ofrei . Thanks - I'll try that!

Sandeek commented 3 years ago

?Hi all,

I will check with Bart to find out the reason.

Best

Sandeep Karthikeyan Data Engineer CoE NORMENT, K.G. Jebsen Centre for Psychosis Research Institute of Clinical Medicine, University of Oslo Division of Mental Health and Addiction, Oslo University Hospital www.med.uio.no/norment/english/http://www.med.uio.no/norment/english/%20 Office: Ullevål Hospital, Building 48 Tel: +47 41390032


From: idaElken notifications@github.com Sent: 02 February 2021 15:53 To: norment/tsd_issues Cc: Sandeep Karthikeyan; Mention Subject: Re: [norment/tsd_issues] /cluster/projects/p697 mount hangs on p697-appn-norment01 (#64)

@ofreihttps://github.com/ofrei . Thanks - I'll try that!

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/norment/tsd_issues/issues/64#issuecomment-771690323, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARAVAASVHMKJHBP7MSE3OVTS5AGXTANCNFSM4W4MPM4A.

Sandeek commented 3 years ago

?Hi all,

The cluster is remounted and available again, unfortunately not able to figure out the reason for the issue :(

Best

Sandeep Karthikeyan Data Engineer CoE NORMENT, K.G. Jebsen Centre for Psychosis Research Institute of Clinical Medicine, University of Oslo Division of Mental Health and Addiction, Oslo University Hospital www.med.uio.no/norment/english/http://www.med.uio.no/norment/english/%20 Office: Ullevål Hospital, Building 48 Tel: +47 41390032


From: idaElken notifications@github.com Sent: 02 February 2021 15:53 To: norment/tsd_issues Cc: Sandeep Karthikeyan; Mention Subject: Re: [norment/tsd_issues] /cluster/projects/p697 mount hangs on p697-appn-norment01 (#64)

@ofreihttps://github.com/ofrei . Thanks - I'll try that!

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/norment/tsd_issues/issues/64#issuecomment-771690323, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARAVAASVHMKJHBP7MSE3OVTS5AGXTANCNFSM4W4MPM4A.

E-Claire commented 3 years ago

This issue is happening for me again.

Sandeek commented 3 years ago

?This is really strange, working with TSD on this.

Best

Sandeep Karthikeyan Data Engineer CoE NORMENT, K.G. Jebsen Centre for Psychosis Research Institute of Clinical Medicine, University of Oslo Division of Mental Health and Addiction, Oslo University Hospital www.med.uio.no/norment/english/http://www.med.uio.no/norment/english/%20 Office: Ullevål Hospital, Building 48 Tel: +47 41390032


From: E-Claire notifications@github.com Sent: 08 February 2021 11:19 To: norment/tsd_issues Cc: Sandeep Karthikeyan; Mention Subject: Re: [norment/tsd_issues] /cluster/projects/p697 mount hangs on p697-appn-norment01 (#64)

This issue is happening for me again.

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/norment/tsd_issues/issues/64#issuecomment-775035678, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARAVAAWUOAGNTNSJT473MLTS563BNANCNFSM4W4MPM4A.

Sandeek commented 3 years ago

Dear Claire,

Can you specify your p697 username? and from when you are facing this issue?

E-Claire commented 3 years ago

My username is p697-elizabethc. I had been working on the appn node and then it suddenly just stoped working. So I tried re-connecting and could access tsd but not the cluster - which is when I replied here.

E-Claire commented 3 years ago

I just tried logging into appn and accessing the cluster now (from the appn node) and I am able to - so super weird that the issue seems really intermittent

Sandeek commented 3 years ago

?Hi Claire,

There was a three minute outage on p697 cluster mount - hence the problem. Right now, it is not hanging.

Best

Sandeep Karthikeyan Data Engineer CoE NORMENT, K.G. Jebsen Centre for Psychosis Research Institute of Clinical Medicine, University of Oslo Division of Mental Health and Addiction, Oslo University Hospital www.med.uio.no/norment/english/http://www.med.uio.no/norment/english/%20 Office: Ullevål Hospital, Building 48 Tel: +47 41390032


From: E-Claire notifications@github.com Sent: 08 February 2021 12:28 To: norment/tsd_issues Cc: Sandeep Karthikeyan; Mention Subject: Re: [norment/tsd_issues] /cluster/projects/p697 mount hangs on p697-appn-norment01 (#64)

I just tried logging into appn and accessing the cluster now and I am able to - so super weird that the issue seems really intermittent

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/norment/tsd_issues/issues/64#issuecomment-775078012, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARAVAAVVB3CHWGH4W2UML4DS57DGVANCNFSM4W4MPM4A.

E-Claire commented 3 years ago

Okay, thanks for looking into this Sundeep!!

In case this happens again - is there a certain amount of time that you would recommend waiting before reporting to see if the outage will just fix its self?

Sandeek commented 3 years ago

This issue has been persisting for some time now, TSD doesn't have much clue about why this occurs frequently, you can let me if and when it occurs.

E-Claire commented 3 years ago

Okay, thanks

idaElken commented 3 years ago

Unsure whether this is related but now do not get past login for neither: p697-appn-norment01.tsd.usit.no; nor p697-submit.tsd.usit.no

After apparently logging in successfully, it hangs:

Screen Shot 2021-02-09 at 09 38 29

Any advice welcome :-)

idaElken commented 3 years ago

Also hangs when trying to access p697-appn-norment01.tsd.usit.no from VMware and putty.

But apparently known issue (sorry for posting): https://www.uio.no/english/services/it/research/sensitive-data/log/nfs-hangs-on-submit-hosts.html

/Ida

Sandeek commented 3 years ago

?Hi Ida,

Can you try now?

Best

Sandeep Karthikeyan Data Engineer CoE NORMENT, K.G. Jebsen Centre for Psychosis Research Institute of Clinical Medicine, University of Oslo Division of Mental Health and Addiction, Oslo University Hospital www.med.uio.no/norment/english/http://www.med.uio.no/norment/english/%20 Office: Ullevål Hospital, Building 48 Tel: +47 41390032


From: Ida Sønderby notifications@github.com Sent: 09 February 2021 09:52 To: norment/tsd_issues Cc: Sandeep Karthikeyan; Mention Subject: Re: [norment/tsd_issues] /cluster/projects/p697 mount hangs on p697-appn-norment01 (#64)

Also hangs when trying to access p697-appn-norment01.tsd.usit.no from VMware and putty.

But apparently now issues (sorry for posting): https://www.uio.no/english/services/it/research/sensitive-data/log/nfs-hangs-on-submit-hosts.html

/Ida

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/norment/tsd_issues/issues/64#issuecomment-775774547, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARAVAAUMMBADX4MJK7BPWHLS6DZTLANCNFSM4W4MPM4A.

ofrei commented 3 years ago

This is resolved (works for me, also reported in operation log). @Sandeek please add to https://docs.google.com/forms/d/e/1FAIpQLSfyQtSd3intuKkb5O4hmmPq5UzX6EhuCk95ovNfHULc7DIBKg/viewform and close this ticket

E-Claire commented 3 years ago

I have this same problem again with the p697-appn hanging when I try and access the cluster. However, I am able to access the cluster through p697-submit.

idaElken commented 3 years ago

Same for me - p697-appn got slower and slower throughout the morning, until it crashed.Now using p697-submit