Closed sentry-io[bot] closed 1 year ago
There is at least 1 mitxonline content file larger than the 10MB elasticsearch limit: document_4554_census80.csv
for course run course-v1:MITxT+14.310x+2T2023
(15 MB). Most of the content extracted by tika seems useless. A subset:
workedm\tweeksm\twhitem\tblackm\thispm\tothracem\tsex1st\tsex2nd\tageq2nd\tageq3rd\tnumberkids\n\t0\t0\t1\t0\t0\t0\t0\t1\t30\t\t2\n\t1\t52\t1\t0\t0\t0\t0\t1\t9\t\t2\n\t1\t30\t1\t0\t0\t0\t1\t0\t22\t\t2\n\t0\t0\t1\t0\t0\t0\t0\t1\t12\t\t2\n\t0\t0\t0\t1\t0\t0\t0\t1\t14\t\t2\n\t0\t0\t1\t0\t0\t0\t0\t0\t50\t48\t3\n\t1\t22\t1\t0\t0\t0\t1\t1\t27\t\t2\n\t1\t26\t1\t0\t0\t0\t1\t1\t46\t22\t3\n\t1\t40\t1\t0\t0\t0\t0\t1\t7\t\t2\n\t0\t0\t1\t0\t0\t0\t1\t0\t25\t\t2\n\t0\t0\t1\t0\t0\t0\t0\t1\t3\t\t2\n\t1\t52\t1\t0\t0\t0\t0\t0\t42\t\t2\n\t0\t0\t1\t0\t0\t0\t1\t0\t8\t\t2\n\t1\t52\t1\t0\t0\t0\t1\t0\t25\t\t2\n\t0\t0\t1\t0\t0\t0\t0\t1\t22\t\t2\n\t0\t0\t1\t0\t0\t0\t0\t0\t11\t6\t3\n\t1\t52\t1\t0\t0\t0\t1\t0\t21\t\t2\n\t1\t10\t1\t0\t0\t0\t1
Options:
http.max_content_length
)@mbertrand IMO those are all valid solutions but there is a catch with http.max_content_length
.
In this particular example, increasing the http.max_content_length
doesn't really get us anything since the content seems to be nonsense (is this possibly a binary or proprietary file-type rather than actually a CSV?).
But just because this file wouldn't benefit doesn't mean that there aren't or won't be other files that may.
HOWEVER, and this a bummer, AWS decides what this value will be for us and we have no option to change it besides changing the instance type. If we were still rolling our own ES/Opensearch this would straightforward to adjust.
https://docs.aws.amazon.com/opensearch-service/latest/developerguide/limits.html#network-limits
😞
Thanks Mike! @pdpinch should I truncate any content that goes over the limit instead? This particular one looks like a tab-delimited file of census/demographics numbers, which is useless for search, but not sure if there's a reliable way to detect something like this. Found a python module that might do it but it hasn't been updated in 4 years: https://github.com/casics/nostril
Detecting content over 10MB is straightforward though
Wow, 10MB is such a low ceiling.
I like this option:
Log a warning and truncate the content so it is under the limit
I didn't understand what is and is not reliable in your last comment @mbertrand. Let me know if my preference doesn't make sense.
@pdpinch I was referring to a way of determining if the content of a file contained text that would be useful for search purposes: actual words and sentences instead of just numbers or random gibberish. But that seems like another issue anyway.
Your preference makes sense to me, the only thing about logging it is that since all the edx content files are processed on each task run, the warning will be logged on each run as well.
Sentry Issue: OPEN-8NA