snowex-hackweek / jupyterhub

jupyterhub configuration for snowex hackweek 2021
https://snowex.hackweek.io
MIT License
1 stars 0 forks source link

Database configuration and max connections #17

Closed scottyhq closed 3 years ago

scottyhq commented 3 years ago

It seems like we can currently only have a limited number of simultaneous connections (not sure exactly how many or where this configuration lives). cc @micahjohnson150 @jomey @lsetiawan if you want to dig in.

the EC2 config is here https://github.com/snowex-hackweek/jupyterhub/blob/main/terraform/eks/ec2_postgres.tf and the actual database setup is probably documented over in https://snowexsql.readthedocs.io/en/latest/

from snowexsql.db import get_db
db_name = 'snow:hackweek@52.32.183.144/snowex'
engine, session = get_db(db_name)
engine.table_names()
OperationalError: (psycopg2.OperationalError) FATAL:  remaining connection slots are reserved for non-replication superuser connections

(Background on this error at: http://sqlalche.me/e/13/e3q8)
Full Traceback ```pytb --------------------------------------------------------------------------- OperationalError Traceback (most recent call last) /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _wrap_pool_connect(self, fn, connection) 2335 try: -> 2336 return fn() 2337 except dialect.dbapi.Error as e: /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in connect(self) 363 if not self._use_threadlocal: --> 364 return _ConnectionFairy._checkout(self) 365 /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in _checkout(cls, pool, threadconns, fairy) 777 if not fairy: --> 778 fairy = _ConnectionRecord.checkout(pool) 779 /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in checkout(cls, pool) 494 def checkout(cls, pool): --> 495 rec = pool._do_get() 496 try: /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/impl.py in _do_get(self) 139 with util.safe_reraise(): --> 140 self._dec_overflow() 141 else: /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py in __exit__(self, type_, value, traceback) 67 if not self.warn_only: ---> 68 compat.raise_( 69 exc_value, /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/compat.py in raise_(***failed resolving arguments***) 181 try: --> 182 raise exception 183 finally: /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/impl.py in _do_get(self) 136 try: --> 137 return self._create_connection() 138 except: /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in _create_connection(self) 308 --> 309 return _ConnectionRecord(self) 310 /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __init__(self, pool, connect) 439 if connect: --> 440 self.__connect(first_connect_check=True) 441 self.finalize_callback = deque() /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __connect(self, first_connect_check) 660 with util.safe_reraise(): --> 661 pool.logger.debug("Error on connect(): %s", e) 662 else: /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py in __exit__(self, type_, value, traceback) 67 if not self.warn_only: ---> 68 compat.raise_( 69 exc_value, /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/compat.py in raise_(***failed resolving arguments***) 181 try: --> 182 raise exception 183 finally: /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __connect(self, first_connect_check) 655 self.starttime = time.time() --> 656 connection = pool._invoke_creator(self) 657 pool.logger.debug("Created new connection %r", connection) /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/strategies.py in connect(connection_record) 113 return connection --> 114 return dialect.connect(*cargs, **cparams) 115 /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/default.py in connect(self, *cargs, **cparams) 507 # inherits the docstring from interfaces.Dialect.connect --> 508 return self.dbapi.connect(*cargs, **cparams) 509 /srv/conda/envs/notebook/lib/python3.8/site-packages/psycopg2/__init__.py in connect(dsn, connection_factory, cursor_factory, **kwargs) 121 dsn = _ext.make_dsn(dsn, **kwargs) --> 122 conn = _connect(dsn, connection_factory=connection_factory, **kwasync) 123 if cursor_factory is not None: OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser connections The above exception was the direct cause of the following exception: OperationalError Traceback (most recent call last) in 1 # Output the list of tables in the database ----> 2 engine.table_names() /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in table_names(self, schema, connection) 2314 """ 2315 -> 2316 with self._optional_conn_ctx_manager(connection) as conn: 2317 return self.dialect.get_table_names(conn, schema) 2318 /srv/conda/envs/notebook/lib/python3.8/contextlib.py in __enter__(self) 111 del self.args, self.kwds, self.func 112 try: --> 113 return next(self.gen) 114 except StopIteration: 115 raise RuntimeError("generator didn't yield") from None /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _optional_conn_ctx_manager(self, connection) 2084 def _optional_conn_ctx_manager(self, connection=None): 2085 if connection is None: -> 2086 with self._contextual_connect() as conn: 2087 yield conn 2088 else: /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _contextual_connect(self, close_with_result, **kwargs) 2300 return self._connection_cls( 2301 self, -> 2302 self._wrap_pool_connect(self.pool.connect, None), 2303 close_with_result=close_with_result, 2304 **kwargs /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _wrap_pool_connect(self, fn, connection) 2337 except dialect.dbapi.Error as e: 2338 if connection is None: -> 2339 Connection._handle_dbapi_exception_noconnection( 2340 e, dialect, self 2341 ) /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception_noconnection(cls, e, dialect, engine) 1581 util.raise_(newraise, with_traceback=exc_info[2], from_=e) 1582 elif should_wrap: -> 1583 util.raise_( 1584 sqlalchemy_exception, with_traceback=exc_info[2], from_=e 1585 ) /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/compat.py in raise_(***failed resolving arguments***) 180 181 try: --> 182 raise exception 183 finally: 184 # credit to /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _wrap_pool_connect(self, fn, connection) 2334 dialect = self.dialect 2335 try: -> 2336 return fn() 2337 except dialect.dbapi.Error as e: 2338 if connection is None: /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in connect(self) 362 """ 363 if not self._use_threadlocal: --> 364 return _ConnectionFairy._checkout(self) 365 366 try: /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in _checkout(cls, pool, threadconns, fairy) 776 def _checkout(cls, pool, threadconns=None, fairy=None): 777 if not fairy: --> 778 fairy = _ConnectionRecord.checkout(pool) 779 780 fairy._pool = pool /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in checkout(cls, pool) 493 @classmethod 494 def checkout(cls, pool): --> 495 rec = pool._do_get() 496 try: 497 dbapi_connection = rec.get_connection() /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/impl.py in _do_get(self) 138 except: 139 with util.safe_reraise(): --> 140 self._dec_overflow() 141 else: 142 return self._do_get() /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py in __exit__(self, type_, value, traceback) 66 self._exc_info = None # remove potential circular references 67 if not self.warn_only: ---> 68 compat.raise_( 69 exc_value, 70 with_traceback=exc_tb, /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/compat.py in raise_(***failed resolving arguments***) 180 181 try: --> 182 raise exception 183 finally: 184 # credit to /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/impl.py in _do_get(self) 135 if self._inc_overflow(): 136 try: --> 137 return self._create_connection() 138 except: 139 with util.safe_reraise(): /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in _create_connection(self) 307 """Called by subclasses to create a new ConnectionRecord.""" 308 --> 309 return _ConnectionRecord(self) 310 311 def _invalidate(self, connection, exception=None, _checkin=True): /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __init__(self, pool, connect) 438 self.__pool = pool 439 if connect: --> 440 self.__connect(first_connect_check=True) 441 self.finalize_callback = deque() 442 /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __connect(self, first_connect_check) 659 except Exception as e: 660 with util.safe_reraise(): --> 661 pool.logger.debug("Error on connect(): %s", e) 662 else: 663 if first_connect_check: /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py in __exit__(self, type_, value, traceback) 66 self._exc_info = None # remove potential circular references 67 if not self.warn_only: ---> 68 compat.raise_( 69 exc_value, 70 with_traceback=exc_tb, /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/compat.py in raise_(***failed resolving arguments***) 180 181 try: --> 182 raise exception 183 finally: 184 # credit to /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __connect(self, first_connect_check) 654 try: 655 self.starttime = time.time() --> 656 connection = pool._invoke_creator(self) 657 pool.logger.debug("Created new connection %r", connection) 658 self.connection = connection /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/strategies.py in connect(connection_record) 112 if connection is not None: 113 return connection --> 114 return dialect.connect(*cargs, **cparams) 115 116 creator = pop_kwarg("creator", connect) /srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/default.py in connect(self, *cargs, **cparams) 506 def connect(self, *cargs, **cparams): 507 # inherits the docstring from interfaces.Dialect.connect --> 508 return self.dbapi.connect(*cargs, **cparams) 509 510 def create_connect_args(self, url): /srv/conda/envs/notebook/lib/python3.8/site-packages/psycopg2/__init__.py in connect(dsn, connection_factory, cursor_factory, **kwargs) 120 121 dsn = _ext.make_dsn(dsn, **kwargs) --> 122 conn = _connect(dsn, connection_factory=connection_factory, **kwasync) 123 if cursor_factory is not None: 124 conn.cursor_factory = cursor_factory OperationalError: (psycopg2.OperationalError) FATAL: remaining connection slots are reserved for non-replication superuser connections (Background on this error at: http://sqlalche.me/e/13/e3q8) ```
jomey commented 3 years ago

A starting point is the rolconnlimit in the pg_roles table https://www.postgresql.org/docs/current/view-pg-roles.html

Check what value is currently set in there.

scottyhq commented 3 years ago

Not sure how the JupyterHub on k8s affects things here. Note that each node has a unique internal IP (hostname -I | awk '{print $1}' --> 172.AA.AAA.AA) and each pod has a unique internal IP (172.BB.BBB.BB), but everyone seems to have the same external IP (curl https://ipinfo.io/ip --> 44.CCC.CC.CCC) regardless of where their server is running.

micahjohnson150 commented 3 years ago

I think what @jomey has pointed out here is what is going one. Something I didn't think of.

micahjohnson150 commented 3 years ago

Well I checked what the user had for max connections it was unlimited.

As in the name of this issue @scottyhq made, the postgres config has a max_connection of.... 100 people. I just changed it to 500

jomey commented 3 years ago

I think the connections != people.

It's a notebook that establishes a connection, so if you have 70 working on 4 parallel notebooks, then you will have 280 connections unless they shut down and restart their kernel

jomey commented 3 years ago

I am not certain on this one and I doubt it exists, but maybe there is a 'idle_timeout' at which point SQLAlchemy shuts down the connection and re-connects if it is used again

micahjohnson150 commented 3 years ago

Yep per our convo, I just set the tcp_keepalives_idle to 900 seconds (15 minutes).

jomey commented 3 years ago

I am closing this issue in favor of continuing with the logged SnowExSql issue