Open erikerlandson opened 2 years ago
cc @MichaelTiemannOSC @caldeirav
Also need to check if we have to add any openshift nodes to accommodate extra resources.
I have a query that supposedly returns 9 rows:
qres = engine.execute("""
select count (*)
from sec_dera.sub as S join sec_dera.num as N on S.cik=1144800 and S.fy >=DATE('2018-01-01') and S.adsh=N.adsh and N.ddate>=DATE('2018-01-01') and abs(date_diff('day', fy, ddate)) > 366
""")
qres.fetchall()
[(9,)]
However, when I try to look at the nine rows:
qres = engine.execute("""
select S.fy, N.ddate, tag
from sec_dera.sub as S join sec_dera.num as N on S.cik=1144800 and S.fy >=DATE('2018-01-01') and S.adsh=N.adsh and N.ddate>=DATE('2018-01-01') and abs(date_diff('day', fy, ddate)) > 366
""")
qres.fetchall()
I get this:
---------------------------------------------------------------------------
TrinoQueryError Traceback (most recent call last)
<ipython-input-51-f3d48ff6bbc4> in <module>
3 from sec_dera.sub as S join sec_dera.num as N on S.cik=1144800 and S.fy >=DATE('2018-01-01') and S.adsh=N.adsh and N.ddate>=DATE('2018-01-01') and abs(date_diff('day', fy, ddate)) > 366
4 """)
----> 5 l = qres.fetchall()
6 len(l)
/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/engine/result.py in fetchall(self)
1215 return l
1216 except BaseException as e:
-> 1217 self.connection._handle_dbapi_exception(
1218 e, None, None, self.cursor, self.context
1219 )
/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
1466 util.raise_from_cause(sqlalchemy_exception, exc_info)
1467 else:
-> 1468 util.reraise(*exc_info)
1469
1470 finally:
/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
127 if value.__traceback__ is not tb:
128 raise value.with_traceback(tb)
--> 129 raise value
130
131 def u(s):
/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/engine/result.py in fetchall(self)
1211
1212 try:
-> 1213 l = self.process_rows(self._fetchall_impl())
1214 self._soft_close()
1215 return l
/opt/app-root/lib64/python3.8/site-packages/sqlalchemy/engine/result.py in _fetchall_impl(self)
1161 def _fetchall_impl(self):
1162 try:
-> 1163 return self.cursor.fetchall()
1164 except AttributeError:
1165 return self._non_result([])
/opt/app-root/lib64/python3.8/site-packages/trino/dbapi.py in fetchall(self)
455
456 def fetchall(self) -> List[List[Any]]:
--> 457 return list(self.genall())
458
459 def cancel(self):
/opt/app-root/lib64/python3.8/site-packages/trino/client.py in __iter__(self)
441 # Subsequent fetches from GET requests until next_uri is empty.
442 while not self._query.finished:
--> 443 rows = self._query.fetch()
444 for row in rows:
445 self._rownumber += 1
/opt/app-root/lib64/python3.8/site-packages/trino/client.py in fetch(self)
518 """Continue fetching data for the current query_id"""
519 response = self._request.get(self._request.next_uri)
--> 520 status = self._request.process(response)
521 if status.columns:
522 self._columns = status.columns
/opt/app-root/lib64/python3.8/site-packages/trino/client.py in process(self, http_response)
388 logger.debug("HTTP %s: %s", http_response.status_code, response)
389 if "error" in response:
--> 390 raise self._process_error(response["error"], response.get("id"))
391
392 if constants.HEADER_CLEAR_SESSION in http_response.headers:
TrinoQueryError: TrinoQueryError(type=INSUFFICIENT_RESOURCES, name=EXCEEDED_LOCAL_MEMORY_LIMIT, message="Query exceeded per-node user memory limit of 819.20MB [Allocated: 812.94MB, Delta: 6.47MB, Top Consumers: {HashBuilderOperator=812.94MB}]", query_id=20211219_122314_00099_bgxv3)
Am I missing something obvious?
@MichaelTiemannOSC It looks like you are bumping up against this:
query.max-memory-per-node#
Type: data size
Default value: (JVM max memory * 0.1)
This is the max amount of user memory a query can use on a worker. User memory is allocated during execution for things that are directly attributable to, or controllable by, a user query. For example, memory used by the hash tables built during execution, memory used during sorting, etc. When the user memory allocation of a query on any worker hits this limit, it is killed.
https://trino.io/docs/current/admin/properties-memory-management.html#query-max-memory-per-node
This is actually big progress - the new memory configurations are causing your query to fail with a proper error exception message, instead of just crashing the worker nodes.
Have noticed that the field COREG in SEC_DERA.NUM is more important than first I thought. When COREG=NULL, it's a fact reported by the consolidated entity, which means one fact of its kind per report, which is what I generally expect. However, when COREG is not null, it means somebody else is reporting a fact about that company, and there can be many sources citing that fact. I'm trying to figure out more about what it means, but just wanted to say that I'm looking very differently at my joins knowing that in one case COREG IS NULL gives one hit and otherwise can give 7,000 or 941,000, depending. Fun!
This is a good example of the duality of 1) providing sufficient resources 2) developing a deep understanding of the data and query behavior to design queries that are as performant as possible.
Regarding (1), I think it would be reasonable to move to 16GB of ram, and also maybe double query.max-memory-per-node
and query.max-total-memory-per-node
from their default values.
I'm not asking to raise limits willy-nilly. I totally concur that we need to better understand the data and query side as well.
That said, I also don't have much experience with database provisioning, nor what constitutes reasonable per-node maxima. Happy to watch and learn on that front.
per Thomas - this as well as Jupyter notebook memory sizing needs to be increased
We have some evidence that queries are failing due to memory limitations on the trino workers. Proposal is to double the memory from its current setting of 4GB to 8GB. Investigate whether increasing the coordinator memory is relevant.