TransportError: TransportError(413, '{"Message":"Request size exceeded 10485760 bytes"}')

mitodl / open-discussions

BSD 3-Clause "New" or "Revised" License

10 stars 2 forks source link

TransportError: TransportError(413, '{"Message":"Request size exceeded 10485760 bytes"}') #3881

Closed sentry-io[bot] closed 1 year ago

sentry-io[bot] commented 1 year ago

Sentry Issue: OPEN-8NA

TransportError: TransportError(413, '{"Message":"Request size exceeded 10485760 bytes"}')
(8 additional frame(s) were not displayed)
...
  File "elasticsearch/client/utils.py", line 76, in _wrapped
    return func(*args, params=params, **kwargs)
  File "elasticsearch/client/__init__.py", line 1148, in bulk
    return self.transport.perform_request('POST', _make_path(index,
  File "elasticsearch/transport.py", line 318, in perform_request
    status, headers_response, data = connection.perform_request(method, url, params, body, headers=headers, ignore=ignore, timeout=timeout)
  File "elasticsearch/connection/http_urllib3.py", line 185, in perform_request
    self._raise_error(response.status, raw_data)
  File "elasticsearch/connection/base.py", line 125, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)

index_run_content_files threw an error

mbertrand commented 1 year ago

There is at least 1 mitxonline content file larger than the 10MB elasticsearch limit: document_4554_census80.csv for course run course-v1:MITxT+14.310x+2T2023 (15 MB). Most of the content extracted by tika seems useless. A subset:

workedm\tweeksm\twhitem\tblackm\thispm\tothracem\tsex1st\tsex2nd\tageq2nd\tageq3rd\tnumberkids\n\t0\t0\t1\t0\t0\t0\t0\t1\t30\t\t2\n\t1\t52\t1\t0\t0\t0\t0\t1\t9\t\t2\n\t1\t30\t1\t0\t0\t0\t1\t0\t22\t\t2\n\t0\t0\t1\t0\t0\t0\t0\t1\t12\t\t2\n\t0\t0\t0\t1\t0\t0\t0\t1\t14\t\t2\n\t0\t0\t1\t0\t0\t0\t0\t0\t50\t48\t3\n\t1\t22\t1\t0\t0\t0\t1\t1\t27\t\t2\n\t1\t26\t1\t0\t0\t0\t1\t1\t46\t22\t3\n\t1\t40\t1\t0\t0\t0\t0\t1\t7\t\t2\n\t0\t0\t1\t0\t0\t0\t1\t0\t25\t\t2\n\t0\t0\t1\t0\t0\t0\t0\t1\t3\t\t2\n\t1\t52\t1\t0\t0\t0\t0\t0\t42\t\t2\n\t0\t0\t1\t0\t0\t0\t1\t0\t8\t\t2\n\t1\t52\t1\t0\t0\t0\t1\t0\t25\t\t2\n\t0\t0\t1\t0\t0\t0\t0\t1\t22\t\t2\n\t0\t0\t1\t0\t0\t0\t0\t0\t11\t6\t3\n\t1\t52\t1\t0\t0\t0\t1\t0\t21\t\t2\n\t1\t10\t1\t0\t0\t0\t1

mbertrand commented 1 year ago

Options:

Increase the elasticsearch limit (via http.max_content_length)
Log a warning if the file is too large and skip it
Log a warning and truncate the content so it is under the limit

Ardiea commented 1 year ago

@mbertrand IMO those are all valid solutions but there is a catch with http.max_content_length.

In this particular example, increasing the http.max_content_length doesn't really get us anything since the content seems to be nonsense (is this possibly a binary or proprietary file-type rather than actually a CSV?).

But just because this file wouldn't benefit doesn't mean that there aren't or won't be other files that may.

HOWEVER, and this a bummer, AWS decides what this value will be for us and we have no option to change it besides changing the instance type. If we were still rolling our own ES/Opensearch this would straightforward to adjust.

https://docs.aws.amazon.com/opensearch-service/latest/developerguide/limits.html#network-limits

😞

mbertrand commented 1 year ago

Thanks Mike! @pdpinch should I truncate any content that goes over the limit instead? This particular one looks like a tab-delimited file of census/demographics numbers, which is useless for search, but not sure if there's a reliable way to detect something like this. Found a python module that might do it but it hasn't been updated in 4 years: https://github.com/casics/nostril

Detecting content over 10MB is straightforward though

pdpinch commented 1 year ago

Wow, 10MB is such a low ceiling.

I like this option:

Log a warning and truncate the content so it is under the limit

I didn't understand what is and is not reliable in your last comment @mbertrand. Let me know if my preference doesn't make sense.

mbertrand commented 1 year ago

@pdpinch I was referring to a way of determining if the content of a file contained text that would be useful for search purposes: actual words and sentences instead of just numbers or random gibberish. But that seems like another issue anyway.

Your preference makes sense to me, the only thing about logging it is that since all the edx content files are processed on each task run, the warning will be logged on each run as well.