hail-is / hail

Cloud-native genomic dataframes and batch computing
https://hail.is
MIT License
984 stars 246 forks source link

Maximum size of job attributes (increase from 64K?) #14702

Open jmarshall opened 1 month ago

jmarshall commented 1 month ago

We recently encountered a batch submission that eventually failed after numerous errors like this one — but nonetheless submitted a new batch containing zero jobs.

[…]
  File "/usr/local/lib/python3.10/site-packages/hailtop/utils/utils.py", line 792, in retry_transient_errors
    return await retry_transient_errors_with_debug_string('', 0, f, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/hailtop/utils/utils.py", line 834, in retry_transient_errors_with_debug_string
    st = ''.join(traceback.format_stack())
. The most recent error was <class 'hailtop.httpx.ClientResponseError'> 500, message='Internal Server Error', url=URL('http://batch.hail/api/v1alpha/batches/485962/updates/1/jobs/create') body='500 Internal Server Error\n\nServer got itself in trouble'. 
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/hailtop/utils/utils.py", line 809, in retry_transient_errors_with_debug_string
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/hailtop/aiocloud/common/session.py", line 117, in _request_with_valid_authn
    return await self._http_session.request(method, url, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/hailtop/httpx.py", line 148, in request_and_raise_for_status
    raise ClientResponseError(
hailtop.httpx.ClientResponseError: 500, message='Internal Server Error', url=URL('http://batch.hail/api/v1alpha/batches/485962/updates/1/jobs/create') body='500 Internal Server Error\n\nServer got itself in trouble'
2024-09-25 01:54:55,288 - hailtop.utils 835 - WARNING - A transient error occured. We will automatically retry. We have thus far seen 50 transient errors (next delay: 60.0s).

The corresponding server-side error was

pymysql.err.DataError: (1406, \"Data too long for column 'value' at row 106\")

coming from the INSERT INTO job_attributes … query in insert_jobs_into_db().

We write a list of the samples being processed as a job attribute, and it turned out that for at least some of the jobs of this batch this list had grown to longer than 64K of text.

The job_attributes.value database field is of type TEXT, which limits each individual attribute to 64KiB bytes.

While writing a long list of sample ids as an attribute may or may not be a great idea :smile: it is fair to say that 64K is not a large maximum for user-supplied data here in the 21st century!

It may be worth adding a database migration to change the job_attributes.value column type (and perhaps also that of job_group_attributes.value) from TEXT to MEDIUMTEXT, which would raise the limit to 16 MiB bytes (at, it appears, a cost of 1 byte per row).

cjllanwarne commented 1 month ago

Hi @jmarshall, the team talked about this issue in our standup today. We had some concerns about appropriateness of using this table as a long term storage area for larger metadata, and the likely developer effort and system downtime to perform the migration. So we currently don't plan on prioritizing this in the immediate future, but do let us know if you have any concerns about that - or if it ends up being impossible for you to work around this - and we might be able to reconsider (or maybe come up with alternative solutions). Thanks!