fix: :ambulance: Fix connection to the db failing when db_reset=True

albireox commented 3 months ago

Attempt at fixing #32. The problem seems to be related to the ContextVar for _state and the fact that the previous change removed the fastapi dependency.

havok2063 commented 3 months ago

This does seem to fix it. I've tested a few valis routes, the cone search and db info search, and they worked. I also tested the zora search submission, which works. However the target page from zora doesn't work, and throws the same error. This may be related to the concurrent queries from the Target page. I'd like to preserve that if possible since it speeds up page loading.

As for the ContextVar thing, we may not need this anymore. I was originally following this FastAPI guide on how to implement peewee with async. Since then, they've deprecated support for peewee since it doesn't play well with async, and suggest using sqlalchemy. https://fastapi.tiangolo.com/how-to/sql-databases-peewee/?h=peewee

havok2063 commented 3 months ago

Another question is should we just make the default be db_reset: false? This would also fix the concurrent issue, but would, I think, essentially create a global db connection, which I'm not sure is what we want. FastAPI recommends creating a new connection per request, thus wrapping it up into a Dependency, but maybe we don't care about that?

albireox commented 3 months ago

Ok, I'll have a look at that issue. I think right now if db_reset=True the code should be almost identical to before the original change, but I must have missed something.

I think by default db_reset should always be True, see https://github.com/sdss/valis/blob/3468f2e95ea1fa1b935140551c10f40e178e9e47/python/valis/settings.py#L42 We can discuss whether we want to change that. I think it's safer to create a new connection for each request; the only problem is that creating a new connection triggers a new reflection of the catalogdb tables, which causes a non-negligible overhead. That could make quick queries to take several times what they should due to the reflection, which anyway is not needed. But I think there are ways to prevent that (or we can add options to sdssdb).

albireox commented 3 months ago

OK, can you give this one more try after pulling? Note that I have also update sdssdb to 0.12.3. This is not strictly necessary for this issue but 0.12.2 would cause reflection of catalogdb in pipelines to be extremely slow because of all the missing tables.

I cannot say I fully understand the issue yet, but there seemed to be two additional problems:

The reset_db_state depend does need to be an async function or the ContextVar doesn't seem to work properly.
When db_reset=True, if Zora requests multiple routes concurrently when the DB has not yet been initialised there would be a race condition error because multiple instances would try to initialise the DB at the same time. So I added an asyncio.Lock to prevent multiple connection attempts at the same time. This works fine with uvicorn in development mode but not with gunicorn; I suspect it's a limitation of the UvicornWorker for gunicorn. Following this I added a timeout to the wsgi configuration.

With these changes the Zora target page works for me in all cases (uvicorn with and without db_reset, and the same with gunicorn in production).

But this is looking like a serious hack at some point. In a different issue we should consider alternatives, but I'll write a couple ideas here for now:

We can remove the use of peewee if it doesn't play well with FastAPI and async and replace it with SQLAlchemy 2 (or maybe SQLModel if we want to stay in the ecosystem). This may be less of an issue now that catalogdb is quite stable and maybe we don't need reflection.
If we continue using peewee we should consider a persistent DB connection (like db_reset=False does). I think that would be ok since peewee reconnects if the connection has been lost. We could also have the DB connect to the server during the app initialisation so that at that point in which a request is processed there are no more issues with multiple connections because there is a single instance of the connection.

albireox commented 3 months ago

The tests are failing right now but I think that will be fixed by merging #36 first.

albireox commented 3 months ago

Sounds good, let's merge this for now and look at long-term solutions.

sdss / valis

fix: :ambulance: Fix connection to the db failing when db_reset=True #34