Production freezing up on certain Resque tasks

rcchen / cs210-stockholm

Repository for CS210: Stockholm Syndrome (Compuverde)

2 stars 2 forks source link

Production freezing up on certain Resque tasks #75

Open rcchen opened 10 years ago

rcchen commented 10 years ago

Examples include the Olympic dataset, which is very concerning. Does not freeze up on local machines (tested on OS X 10.9.2)

rcchen commented 10 years ago

Was originally freezing up on OlympicSmall but a server reboot appears to have mitigated the issue.

trumanc commented 10 years ago

It's sound like this is no longer a problem after we re-worked and condensed datadoc representations in the database, so I'm closing this.

rcchen commented 10 years ago

Resque still has occasional issues that I don't totally understand yet which is kind of awkward....it quit out about 90% through the Olympic data set.

rcchen commented 10 years ago

Found the error: ** [22:14:25 2014-04-28] 8769: Failed to start worker : #<Errno::ENOMEM: Cannot allocate memory - fork(2)>

trumanc commented 10 years ago

Would this be caused by the fact that we create one monolithic dataset object with all of the datadocs, and then do one save so it has to store everything in memory?

Is it possible to do more incremental saves by doing docs.eac do |doc| doc.dataset = @dataset doc.save end

instead of

docs.each do |doc| @dataset.datadocs << doc end @dataset.save

rcchen commented 10 years ago

I think that we should spawn off a separate worker job for every x rows in a data set (say like 1000 rows) so we don't do massive 16k row chunks.

rcchen commented 10 years ago

That being said, I would not discount the possibility of a major memory leak on our system.

rcchen commented 10 years ago

OK this is pretty awkward.

root@syndrome:~# free total used free shared buffers cached Mem: 503240 426012 77228 0 11420 38140 -/+ buffers/cache: 376452 126788 Swap: 0 0 0

I'm going to add swap space to see if that resolves memory issues.

rcchen commented 10 years ago

Was able to process 25k rows (9k + 16k) on a single worker, so I think issues may be resolved now. The worker doesn't run automatically on production because I haven't figured out how to spawn and shut down worker threads programatically yet, but at the very least it isn't that big of an issue anymore.