usegalaxy-no / galaxyadmin

A repository for managing the work of the usegalaxy.no GalaxyAdmin team
0 stars 0 forks source link

Stabilize usegalaxy <-> NeLS dataset transfers #9

Closed kjellp closed 2 years ago

kjellp commented 3 years ago

Some times issues with users being detected as logged out of galaxy session in when beeing re-directed back to galaxy from NeLS, When this happens, the initation of the transfer fails and user is left with nothing happening in the galaxy interface, and needs to try again.

kjellp commented 3 years ago

The issue has been explored a bit by Kjell with help from Kidane: 1) it can be reproduced in prod, using chrome and firefox, incognito and normal modes. 2) It only happens with Send/Get datasets, and not with Import/Export histories, using the same nels end-points before redirection back to usegalaxy.no 3) First main difference between old NeLS Galaxy and new usegalaxy.no setup in relation to NeLS: old setup shared same *.bioinfo.no domain, the new setup has two different domains for the two services. 4) Second main difference: usegalaxy.no galaxy pages is served by nginx directly, old servers used Apache 5) The randomness of the errors seems to be linked with the success rate of galaxy correclty re-initiating the galaxy-session cookie of the user in the browser when nels post the redirect url back to usegalaxy.no, i.e. the cookie is sometimes lost when user is redirected to another domain and returning again.

Some additional exploration notes and possible solutions to try out further to explore:

history import/export callback uses "nga" while Get/Send dataset tools use galaxy internal code which depends on Galaxy cookies to re-estabilish the user's session. It gets lost at times forcing the user to re-login

example url of history export:

example url of send files to nels:

hypothesis

bruggerk commented 3 years ago

However when doing history exports to Nels from one of the old galaxies the route is uib-galaxy->nels->usegalaxy->uib-galaxy and I have never seen that breaking. I rather think it is during the job registration/creation without the galaxy frames re-direction thingy that things break as the NGA is not doing that through galaxy but through bioblend. /Kim

On Wed, 2020-12-02 at 02:25 -0800, Kjell Petersen wrote:

The issue has been explored a bit by Kjell with help from Kidane:

it can be reproduced in prod, using chrome and firefox, incognito and normal modes. It only happens with Send/Get datasets, and not with Import/Export histories, using the same nels end-points before redirection back to usegalaxy.no First main difference between old NeLS Galaxy and new usegalaxy.no setup in relation to NeLS: old setup shared same *.bioinfo.no domain, the new setup has two different domains for the two services. Second main difference: usegalaxy.no galaxy pages is served by nginx directly, old servers used Apache The randomness of the errors seems to be linked with the success rate of galaxy correclty re-initiating the galaxy-session cookie of the user in the browser when nels post the redirect url back to usegalaxy.no, i.e. the cookie is sometimes lost when user is redirected to another domain and returning again.

Some additional exploration notes and possible solutions to try out further to explore: history import/export callback uses "nga" while Get/Send dataset tools use galaxy internal code which depends on Galaxy cookies to re-estabilish the user's session. It gets lost at times forcing the user to re-login example url of history export:

https://nels.bioinfo.no/welcome.xhtml?appCallbackUrl=https://usegalaxy.no/nga/export/eb1d162a18/9864fad2cc/

example url of send files to nels:

https://nels.bioinfo.no/welcome.xhtml?appCallbackUrl=https%3A//usegalaxy.no/root%3Ftool_id%3Dnels_exporter_hidden&appName=Galaxy&GALAXY_URL=https%3A//usegalaxy.no/tool_runner%3Ftool_id%3Dnels_export

hypothesis

the cookies set by Galaxy after a successful login seem to be lost sometimes when going to NeLS and coming back (with a POST)

idea-1: nginx's proxy & reverse proxy settings could be dropping cookies because of the user's transit to another domain and back .i.e. usegalaxy.no (GET) -> nels.bioinfo.no (GET with callBackUrl) -> usegalaxy.no (POST to the callback URL). explore such settings and make sure all nginx is not dropping cookies

idea-2: if it's due to the "different domain" issue, assuming idea-1 doesn't fix it, it could be tried by opening the cookie policy on the nginx to allow-all at first, and allow-specific domain at later time

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

bruggerk commented 3 years ago

Had a dig around and it looks like a general problem and not isolated to the nels transfer tools

I used the: Get Data -> UCSC Archaes table browser

Has anyone checked with the community if this is only experienced by us or by others as well?

/kim

kjellp commented 3 years ago

@torfinnnome and I did some testing Wed-Friday, turning the number of uwsgi threads down from 4 to 1. Then the error disappeared, and it did not reappear when no of threads was reset again to 4 on Friday. (But I assume things were changed in parallell by @bruggerk, so those changes could be the real fix, or an updated patch from the play-book being re-run).

I tried to re-create the error @bruggerk mentioned (Get Data -> UCSC Archaes table browser), but the re-direction worked nicely there today (will try a bit more extensively), import job gets registered correctly (but does not start, different issue most likely). Have not heard others reporting similar trouble, and couldn't find any mention when googling a bit.

bruggerk commented 3 years ago

On Mon, 2020-12-07 at 00:25 -0800, Kjell Petersen wrote:

@torfinnnome and I did some testing Wed-Friday, turning the number of uwsgi threads down from 4 to 1. Then the error disappeared, and it did not reappear when no of threads was reset again to 4 on Friday.

By limiting the number of threads to 1 we lose all parallelism --> longer load times.

I tried to re-create the error @bruggerk mentioned (Get Data -> UCSC Archaes table browser), but the re-direction worked nicely there today (will try a bit more extensively). Have not heard others reporting similar trouble, and couldn't find any mention when googling a bit.

Was this using the test or prod server? I used the test one. Just to clarify is this a problem on both the test and prod server or just the test one?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

bruggerk commented 3 years ago

just tried the archea import on prod and it locks me out. /Kim On Mon, 2020-12-07 at 00:25 -0800, Kjell Petersen wrote:

@torfinnnome and I did some testing Wed-Friday, turning the number of uwsgi threads down from 4 to 1. Then the error disappeared, and it did not reappear when no of threads was reset again to 4 on Friday.

I tried to re-create the error @bruggerk mentioned (Get Data -> UCSC Archaes table browser), but the re-direction worked nicely there today (will try a bit more extensively). Have not heard others reporting similar trouble, and couldn't find any mention when googling a bit.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

kjellp commented 3 years ago

All tests are done using test.usegalaxy.no (but the error was verified on both test and prod 5 days ago).

I just tried "Get Data -> Main UCSC", and got logged out 2 out of ~8 times... So it's still there. Not possible to test "Send/Get to/from NeLS" as the nels test stack currenly gives (401 - Unauthorized on all files and folders after login).

bruggerk commented 3 years ago

It was resat after a redeployment of the sites this morning, but works with prod (both history in/ex-port and normal im/ex-port.Plus it is back running with 4 threads. /Kim

On Mon, 2020-12-07 at 00:37 -0800, Kjell Petersen wrote:

All tests are done using test.usegalaxy.no (but the error was verified on both test and prod 5 days ago). I just tried "Get Data -> Main UCSC", and got logged out 2 out of ~8 times... So it's still there. Not possible to test "Send/Get to/from NeLS" as the nels test stack currenly gives (401 - Unauthorized on all files and folders after login).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

kjellp commented 3 years ago

It was resat after a redeployment of the sites this morning, but works with prod (both history in/ex-port and normal im/ex-port.Plus it is back running with 4 threads. /Kim

Sorry don't follow now, what was reset? The test.usegalaxy.no server? (or test nels (or nga)). I'm having 7 hanging import jobs on test.usegalaxy.no and can not see any files/folder content in test nels due to 401 (could be user specific error).

I noticed test.usegalaxy.no was connected to prod nels for a short period, so I assume you refer to that transfers were working as expected then?

K.

On Mon, 2020-12-07 at 00:37 -0800, Kjell Petersen wrote: All tests are done using test.usegalaxy.no (but the error was verified on both test and prod 5 days ago). I just tried "Get Data -> Main UCSC", and got logged out 2 out of ~8 times... So it's still there. Not possible to test "Send/Get to/from NeLS" as the nels test stack currenly gives (401 - Unauthorized on all files and folders after login). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

bruggerk commented 3 years ago

Yes the test.usegalaxy.no was redeployed, prior to that the test.usegalaxy.no used the prod-NeLS setup for testing for both types of imports/exports as the test-NeLS is broken. What kind of imports are we talking about and on what server? I assume general NeLS transfers as I cannot see anything hanging in any of the NGA instances.

On Mon, 2020-12-07 at 00:50 -0800, Kjell Petersen wrote:

It was resat after a redeployment of the sites this morning, but works with prod (both history in/ex-port and normal im/ex-port.Plus it is back running with 4 threads. /Kim

Sorry don't follow now, what was reset? The test.usegalaxy.no server? (or test nels (or nga)). I'm having 7 hanging import jobs on test.usegalaxy.no and can not see any files/folder content in test nels due to 401 (could be user specific error). I noticed test.usegalaxy.no was connected to prod nels for a short period, so I assume you refer to that transfers were working as expected then? K.

On Mon, 2020-12-07 at 00:37 -0800, Kjell Petersen wrote: All tests are done using test.usegalaxy.no (but the error was verified on both test and prod 5 days ago). I just tried "Get Data -> Main UCSC", and got logged out 2 out of ~8 times... So it's still there. Not possible to test "Send/Get to/from NeLS" as the nels test stack currenly gives (401 - Unauthorized on all files and folders after login). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

kjellp commented 3 years ago

Yes the test.usegalaxy.no was redeployed, prior to that the test.usegalaxy.no used the prod-NeLS setup for testing for both types of imports/exports as the test-NeLS is broken. What kind of imports are we talking about and on what server? I assume general NeLS transfers as I cannot see anything hanging in any of the NGA instances.

They are Get Data -> UCSC (Main or Arhcea) import jobs that have been successfully initiated (successfully redirecting back to galaxy) but not started after being queued. On test.usegalaxy,no.

bruggerk commented 3 years ago

Loads of digging and debugging around, I think this might be the source of the problem:

image

Not sure how to fix it, but believe it is around here: /srv/galaxy/server/lib/galaxy/web/framework/webapp.py

Will keep digging

bruggerk commented 3 years ago

Some serious hacking of std python libraries reveal that this was not the case, setting SameSite = Lax for the cookie does not change anything, but remove a warning or two.

Actually is now never working

bruggerk commented 3 years ago

I have debugged and tracked my way through the failing transfers and it originates from uwsgi and not in the galaxy code. Currently it fails in 1 out of 6 imports regardless the number of processes/threads used (within a reasonable sampling space).

I suggest talking to the community regarding this unless someone wants to do the uwsgi bit

kjellp commented 3 years ago

We discussed in the GalaxyAdmin group to post a request in the gitter channel, after testing if the bug is reproduced on usegalaxy.eu/org as well. Not been able to test or post yet.

kjellp commented 3 years ago

Not able to reproduce in usegalaxy.no (prod) now. But is present every time in test.usegalaxy.no, consistently across any remote data service : NeLS, UCSC Main, others). Post back URLs are being redirected to the login page instead of starting an import job in the active session.

kjetilkl commented 2 years ago

I'm closing this issue since this functionality is now covered by the NeLS remote storage plugin and the old tools have been deprecated