'Out of sort memory' error when downloading data

smathot commented 3 years ago

@nclaidiere experienced the following error when trying to download data:

I haven't seen this myself with v0.6.0, but I also don't have tremendously large datasets yet (a few thousand rows). @nclaidiere Can you give us a rough estimate of the number of trials that you're trying to download here? And I'm guessing this doesn't happen with small datasets, right?

dschreij commented 3 years ago

I haven't seen this before myself either. It may be solved by just setting a parameter in the server settings, or the system that omm-server is currently run on simply has too little RAM and increasing this will solve the issue.

nclaidiere commented 3 years ago

I think it may be linked to the multiple connections problem... but we will need to run a bunch of studies to see if that comes up again. Let's wait and see!

dschreij commented 3 years ago

Alright. Keep me posted. And if it does occur again, please provide the steps which may help us reproduce the problem.

dschreij commented 3 years ago

Has this error popped up more frequently? If not, could it have been spurious or hardware dependent?

nclaidiere commented 3 years ago

We haven't had another long go at it so we don't know, we are waiting for the next release, I'll let you know if it appears again.

nclaidiere commented 2 years ago

Hi Guys!

So this error is back! Pascal was just prevented from pulling data for an experiment with a few hundred rows (so there is no problem with the RAM), from what I read it seems to appear when mysql is sorting json tables:

https://github.com/mozilla/addons/issues/8068

Our investigations so far: The query to reproduce it is simply SELECT * FROM scanners_results ORDER BY scanners_results.created DESC LIMIT 10; The problem is that created is not indexed - we thought it was That forces MySQL to do a filesort for this query. Because some rows are quite large (they contain raw json) MySQL needs a lot of space to do the sorting We're hoping an index would fix it. We're going to try to reproduce locally. It's still possible it was caused by the upgrade, but it could just be bad luck, just caused by one big row added to the table recently.

See also the full thread on:

https://stackoverflow.com/questions/29575835/error-1038-out-of-sort-memory-consider-increasing-sort-buffer-size

The problem with increasing the sorting is clearly that the issue will reappear with large experiments.

What are your thoughts?

Cheers, Nico (& Pascal)

smathot commented 2 years ago

Thanks for this thorough investigation :+1:

dschreij commented 2 years ago

Thanks for looking into this. I need to dive into this a bit further. I think we indeed sort by a timestamp such as created_at, so it might be possible to make this an index which could resolve this sorting problem. It may also be problematic because an index has to have a unique value, and there is no such guarantee for a timestamp. For instance if two participants add a datapoint at exactly the same time, the database will give an error. Even though such a chance is quite slim, it should be taken into account.

dschreij commented 2 years ago

Also, if I work on this it is hard for me to reproduce this problem to determine my changes fixed it. @smathot have you been able to reproduce this issue?

smathot commented 2 years ago

@dschreij I haven't reproduced it myself, but I assume it's just a matter of seeding the database with enough data until the sort error occurs.

nclaidiere commented 2 years ago

I just asked Pascal if the error is still occurring (I know he restarts the experiment regularly), if it is I could get download the DB and send it to you (I might need some directions to do that...). What do you think?

dschreij commented 2 years ago

I seeded the database with 1000 participants, which all had completed a study with 15 jobs. This results in 15.000 rows of data in Excel, which my laptop had no trouble exporting. I don't know how much data I have to seed to reproduce the problem, but if you already say the error occurred for you with around 250 records, I think we have to look at other causes or factors than mysql.

Can you tell me a bit more about the hardware on which you are running the system? For instance, how much GB of RAM (memory) does it have, or what is the age of the machine?

dschreij commented 2 years ago

Another difference is that Sebastiaan and I are running on Linux and MacOS, while you @nclaidiere is probably using Windows. Additionally, @smathot are you using a locally installed MySQL or the Docker container? I am using the former atm, so it may be worth trying the same with a Docker container again.

dschreij commented 2 years ago

Further reading suggests that rather than increasing the sort_buffer_size parameter of MySQL, I should try to optimize the used query and only perform order by and group by operations on variables that are indexed (as @nclaidiere already pointed out in his investigation above). The query I use now does use a lot of order and group bies on variables that are not indexed, so there is definitely room for improvement in this regard. I needed to do this to efficiently aggregate the data collected for jobs, but I may be able to improve on this by using subqueries. The only problem is that I cannot tell if my changes or improvements have any effect, because I cannot reproduce this issue myself.

https://github.com/open-cogsci/omm-server/blob/master/app/Models/Study.js#L314-L332

dschreij commented 2 years ago

After doing some more reading, I found that this is a reported bug in MySQL :)

https://bugs.mysql.com/bug.php?id=103225

The culprit is indeed the json column (as stated earlier) and even sorting on indexed columns did not remedy this issue in some cases. The bug should be fixed from MySQL 8.0.28 and onwards. Can you check if redownloading the mysql container and forcing a downloading the new version fixes this issue? You can discard your current mysql container by clicking the trash can icon next to it in Docker desktop. If you execute docker-compose up -d again, the new mysql should automatically be pulled in again.

dschreij commented 2 years ago

https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-28.html

This version is not released yet so the above steps are not useful yet. Better keep an eye on the release notes and pull in the new mysql once 8.0.28 gets released.

smathot commented 2 years ago

Good to hear that at least the cause is clear! The bug report also says that MySQL 8.0.17 did not suffer from this issue. Since we don't know when 8.0.28 will be released, and we shouldn't let @nclaidiere et al wait unnecessarily, can they also downgrade the MySQL container to 8.0.17?

dschreij commented 2 years ago

Yes you can indicate the specific version you'd like to have in the docker-compose.yml file https://support.circleci.com/hc/en-us/articles/115015742147-Pinning-a-Docker-Image-to-a-Specific-Version

If you then delete and pull in mysql again you should have that version

smathot commented 2 years ago

Ok, good. Can you update the docker-compose.yml, test it, and then give clear instructions to everyone on how to pull the correct versions? That way we can sort this out right now. :smile_cat:

nclaidiere commented 2 years ago

Fantastic, that sounds like a very good solution. Happy new year to you!

dschreij commented 2 years ago

Ok, good. Can you update the docker-compose.yml, test it, and then give clear instructions to everyone on how to pull the correct versions? That way we can sort this out right now. 😸

I can test it, but because I cannot reproduce the issue myself, the real testing has to be done by @nclaidiere.

I tried downgrading MySQL from 8.0.27 to 8.0.16, but this does not work out of the box. Apparently the mysql log format is tied to its specific version, so when I launch MySQL with the older version, I get this message after startup

[InnoDB] Unknown redo log format (4). Please follow the instructions at http://dev.mysql.com/doc/refman/8.0/en/ upgrading-downgrading.html.

and the process shuts down.

According to https://stackoverflow.com/a/50401030/16169022) this can be remedied by deleting the log files but the data needs to be backed up beforehand.

The MySQL devs themselves say downgrading MySQL versions is not supported: https://dev.mysql.com/doc/refman/8.0/en/downgrading.html

So, this is not an easy thing to do and will have a minor risk of data loss. What do you recommend? It may be worth the wait for MySQL 8.0.28 after all. I expect this version to be released sometime soon, as there have been 3 months at most between previous releases:

nclaidiere commented 2 years ago

Ok, I can't do it myself for now, the lab is shut down because of covid for at least two weeks. So this would have to be Pascal and I prefer not to mess too much with Pascal's system so I suggest we wait for a bit and cross our fingers!!!! Meanwhile, extracting the data through other means would be great.

dschreij commented 2 years ago

The new mysql version 8.0.28 has recently been released. You should be able to update it simply with docker-compose pull mysql in the folder which contains the docker-compose.yml. Let me know if this fixes the issue, or the hunt goes on.

nclaidiere commented 2 years ago

I think this can be closed now.

smathot commented 2 years ago

This error (or something that resembles it) still persists with MySQL 8.0.29, although it seems to require more data now. I left an automated test run running overnight and in the morning it had crashed. The direct reason for the crash seems to be that the current_job_idx endpoint also crashes with an out-of-sort-memory error.

My tentative diagnosis is the following: this test experiment was 'infinite', which means that there were only two jobs (for two conditions) which were continuously reset at the end of each trial. This is a typical scenario for a default experiment that is mainly there to give monkeys something to do.

The test run crashed after about 250 trials, each of which stored quite a bit of data (all variables were logged), resulting in about 500 Kb of data stored across two blobs in these two job_statuses. I suspect that the size of the individual blobs is what causes issues, and not so much the number of jobs.

Relevant screenshots:

Screenshot from 2022-06-24 09-56-16

So the crash is no longer limited to downloading the data, but also affects other endpoints. Downloading the data is also affected though:

Screenshot from 2022-06-24 09-51-32

This is with OMM 0.7.4 and MySQL 8.0.29:

Screenshot from 2022-06-24 09-52-16

dschreij commented 2 years ago

Can you provide me with an sql file with a DB export?

smathot commented 2 years ago

Can you provide me with an sql file with a DB export?

I cleared the database, but you can reproduce it with the match-to-sample task from the examples. To save time, you can change the touch_response so that it times out very quickly, and change the entry point so that it automatically announces the same participant over and over again. In the screenshot below you see how you can do that, notably by having _s around the participant name. (For technical reasons.) This way you can do an automated test run, which will trigger this issue after a few hundred trials.

Screenshot from 2022-07-04 14-02-58

https://github.com/open-cogsci/omm-client/tree/master/osexp

dschreij commented 2 years ago

Thanks, I'll try it out soon, but I will need to set up the omm-client first on my machine. It seems that ordering by a non-index column often causes the out of memory error. I may be able to solve this by also indexing the job position column, or create a composite index of (id, position) columns on the jobs table.

dschreij commented 2 years ago

To start with, I drastically improved the query that was used to export the collected data. After some research into this issue, I found that group by or order by operations on non-indexed columns are usually the culprit, so I slimmed those down by making smarter use of subqueries. As far as I can tell the output of the data file is identical (e.g. same columns, all data present), but I'd like you (or maybe only @smathot at this point) to confirm that I am right on this.

The next step is to stress test the system with this new query by creating massive amounts of data as suggested by Sebastiaan above. If one of you already sees a chance to test it out on a big dataset you have at hand, please feel free to do so. Please do note that I haven't released these changes yet and they are only on the develop branch of this project.

smathot commented 2 years ago

:+1: Thanks, that sounds good. Did you run any tests with the procedure I described above? I want to make sure that this passes before we do more extensive testing in cubicles at Rousset.

dschreij commented 2 years ago

The next step is to stress test the system with this new query by creating massive amounts of data as suggested by Sebastiaan above.

Nope ;)

I assumed there was lot to be gained from improving the query and adding some indices so I did that first. I am fairly sure this was the cause of the problem after doing some research into this error, but I still need to setup an OpenSesame installation to do the automated stress test

smathot commented 2 years ago

I am fairly sure this was the cause of the problem after doing some research into this error, but I still need to setup an OpenSesame installation to do the automated stress test

Please test first! You don't need to do an extensive stress test (we will do that), but it's important to test the basics in the context in which it will be used, i.e. with the OMM client.

dschreij commented 2 years ago

I cleared the database, but you can reproduce it with the match-to-sample task from the examples. To save time, you can change the touch_response so that it times out very quickly, and change the entry point so that it automatically announces the same participant over and over again. In the screenshot below you see how you can do that, notably by having _s around the participant name. (For technical reasons.) This way you can do an automated test run, which will trigger this issue after a few hundred trials.

https://github.com/open-cogsci/omm-client/tree/master/osexp

I tried setting this up, but the match to sample task in the examples folder is completely different from your screenshot above? In any case, the example experiment doesn't even include an announce item. I will tinker a bit to try and get it to work, or maybe you have a better suggestion to use as an alternative experiment.

smathot commented 2 years ago

The match-to-sample experiment should be uploaded to the server, but not opened in OpenSesame. Rather, in OpenSesame you need to create an entry point with the OMM extension, which is what you see in the screenshot above.

dschreij commented 2 years ago

Right, thanks! (duh...)

(the earlier posted error no longer occurs when the study is uploaded, so I removed this in an edit)

The study appears to run now, but it doesn't go really fast. Looks like one trial every 5 seconds or so. I changed the timeout of the touch response to 50 ms. Am I doing something else incorrectly?

It's going to take a while until it reaches the necessary no. of trials this way I think :)

smathot commented 2 years ago

The study appears to run now, but it doesn't go really fast. Looks like one trial every 5 seconds or so. I changed the timeout of the touch response to 50 ms. Am I doing something else incorrectly?

The touch response should indeed have a short timeout, and you can also reduce the duration of other items (if any), such as sketchpads. What remains is of course the delay due to the communication with the server, but for me this is very small, so I think it ran at about 3 trials per second.

dschreij commented 2 years ago

Hmm ok, I changed all timeouts to 1ms, but still communicating with the server takes long, about 2 seconds per request on average:

so I think I am going to abort this test run to be able to look into the other bugs, and leave this running overnight or something.

smathot commented 2 years ago

Are you using a remote server? I tested it with two computers on the same local network.

dschreij commented 2 years ago

No I have everything running on my (actually very new) PC, but on Windows. mysql runs in docker, and omm-server runs locally (with npm run dev) as does the client.

dschreij commented 2 years ago

I let it run for a while, and the error still occurs:

code: 'ER_OUT_OF_SORTMEMORY',
  errno: 1038,
  sqlState: 'HY001',
  sqlMessage: 'Out of sort memory, consider increasing server sort buffer size',
  sql: 'select `jobs`.*, `job_states`.`job_id` as `pivot_job_id`, `job_states`.`participant_id` as `pivot_participant_id`, `job_states`.`data` as `pivot_data`, `job_states`.`status_id` as `pivot_status_id`, `job_states`.`status_id` as `pivot_status_id` from `jobs` inner join `job_states` on `jobs`.`id` = `job_states`.`job_id` where `study_id` = 10 and `job_states`.`status_id` in (1, 2) and `job_states`.`participant_id` = 101 order by `pivot_status_id` desc, `position` asc limit 1',
  status: 500

Quite a bummer. I'm running out of ideas here and will need more time to investigate again.

smathot commented 2 years ago

Ok, I think this is a fundamental design issue due to how multiple results for a single job are stored in a single large JSON field. This is bound to break at some point for experiments that run infinitely, thus resulting infinitely growing JSON fields. If we optimize things (as you probably did now), it will just take longer to break, but not really fix anything.

Let's consider this a limitation, and work around it by implementing infinite experiments through very large job tables. For that purpose though, mozilla/addons-server#115 should be improved, because right now uploading large job tables (of say > 10,000) is slow to the point of being unusable.

dschreij commented 2 years ago

Ok, I think this is a fundamental design issue due to how multiple results for a single job are stored in a single large JSON field. This is bound to break at some point for experiments that run infinitely, thus resulting infinitely growing JSON fields. If we optimize things (as you probably did now), it will just take longer to break, but not really fix anything.

Is it though? I am not completely convinced of this yet. What I read about the out of sort memory error is that it is often due to sorting on or grouping by columns that are not indexed. In that case, Mysql reverts to some inefficient sorting algorithm that requires a lot of memory.

I also did some more searching on this problem, and found that you can increase the sort memory buffer size with a config setting (see https://stackoverflow.com/questions/29575835/error-1038-out-of-sort-memory-consider-increasing-sort-buffer-size) but I don't know what a sensible value would be or, what you also say, we just hit the ceiling later and we will be confronted with this problem again for bigger data sizes (as is being aptly said in this comment)

dschreij commented 2 years ago

Also, see https://stackoverflow.com/a/65128692 Since we use docker, we can easily play around with setting the sort buffer sizes by adding them to the docker-compose.yml, e.g.:

db:
    image: mysql:8
    command: --sort_buffer_size=512K

smathot commented 2 years ago

Before we invest any more effort in this, consider this: given that an infinite experiment may, after a while, accumulate tens of thousands of result sets for a single job, and that all of these have to be stored in a single JSON field, will this ever work? Even if the sort error can be solved, then at some point we're going to run into other limits, right? I'm not 100% sure about this, because I don't know how MySQL behaves when fields become extremely large.

nclaidiere commented 2 years ago

Hi, just to set some reasonable limits, 10 000 times the same job is definitely within reach (a long but normal experiment), 100 000 times too (this is typically a background task that comes on a regular basis), I think 1M is a reasonable limit for the system (this would take several years of background tasks, a new experiment with a new name could then be started). I hope this helps.

smathot commented 5 months ago

Closing in favor of mozilla/addons-server#155

open-cogsci / omm-server

'Out of sort memory' error when downloading data #92