Open GoogleCodeExporter opened 8 years ago
1) maybe we need another update
or
2) restore to last good copy and then update
or
3)
full rebuild.
Original comment by zack...@gmail.com
on 5 Sep 2014 at 5:59
I am going to try approach 2) first.
I am going to put a notice saying :
"We are working on reported issues. Sorry for inconvenience."
Original comment by natalia....@gmail.com
on 5 Sep 2014 at 6:53
There were 3 insertQC.pl processes running, from previous days probably. Killed
them, restored db to August 25 copy and started update
Original comment by natalia....@gmail.com
on 5 Sep 2014 at 7:42
Yeah, That was probably the problem.
We should check if another process is running and die if true, to prevent this
Original comment by zack...@gmail.com
on 5 Sep 2014 at 7:51
How about making sequencingDevel_Everyday.sh look like this:
#!/bin/sh
p=`ps -ef | grep insertQC.pl | awk END'{print NR}'`;
if [ $p -eq 1 ]
then
./insertQC.pl 4
fi
Original comment by natalia....@gmail.com
on 5 Sep 2014 at 8:37
Storage is still slow. It takes 3 sec for ls /storage to return
Original comment by natalia....@gmail.com
on 5 Sep 2014 at 8:42
unfortunately this is somewhat expected since a lot of disk operations are
happening on the head node right now.
Original comment by zack...@gmail.com
on 5 Sep 2014 at 9:17
I started the update at 12:40 and the process is hanging since 18:22. I am
curious when storage is going to be back to normal. I disabled the update since
the current one is hanging.
Original comment by natalia....@gmail.com
on 6 Sep 2014 at 3:42
is it completely hung or just very slow, there is a big difference between the
two.
you can ssh into hpc-laird and see the load. it looks as if moiz is backing up
hundreds of gigs.
Original comment by zack...@gmail.com
on 6 Sep 2014 at 3:46
Issue 807 has been merged into this issue.
Original comment by zack...@gmail.com
on 6 Sep 2014 at 3:46
also, have you checked the load on the other VM. that is the storage gateway.
if there are many stalled rsyncs happening there, then it is definitely part of
the problem.
Original comment by zack...@gmail.com
on 6 Sep 2014 at 3:49
nevermind, I checked and there were some rsyncs going, i killed them
Original comment by zack...@gmail.com
on 6 Sep 2014 at 3:51
update log had not moved for 2.5 hours, this was not good.
Once I cleaned up some rsyncs etc, it started moving again
Original comment by zack...@gmail.com
on 6 Sep 2014 at 3:54
looks like things sped up big time. Moiz moved over to hpc-uec for his backups.
anyways, update finished, please verify that things are working correctly.
Original comment by zack...@gmail.com
on 6 Sep 2014 at 7:53
Looks normal to me , so I am removing the notice
Original comment by natalia....@gmail.com
on 6 Sep 2014 at 10:44
beta is still messed up
Original comment by zack...@gmail.com
on 9 Sep 2014 at 5:57
started rebuild
On Mon, Sep 8, 2014 at 10:57 PM, <usc-epigenome-center@googlecode.com>
wrote:
Original comment by natalia....@gmail.com
on 9 Sep 2014 at 7:07
The rebuild failed at 1 AM with the message "DBD::mysql::st execute failed:
MySQL server has gone away at ./insertQC.pl line 694."
And it was running for 12 hours, which is slow for parallel rebuild. I guess
somebody transfered their data again yesterday which slowed down the update.
Original comment by natalia....@gmail.com
on 10 Sep 2014 at 6:36
The rebuild was failing for two days because of the db connection timeout. When
running a parallel rebuild the update script establishes a db connection and
then calls itself recursively. The established connection remains idle until
all the threads are finished. The max default time for keeping connection alive
in mysql is set to 8 hours. Our update was running for more then 12 hours,
which caused the mysql server to close the connection.
To prevent this in the future, it's better to close the connection before
spawning the threads and re-open it after all the threads are done.
Original comment by natalia....@gmail.com
on 12 Sep 2014 at 7:44
The question is what caused the update to run for over 12 hours? Was it related
to the migration of epifire2 to the new VM?
Original comment by natalia....@gmail.com
on 19 Sep 2014 at 11:30
High load on storage servers
-zack (mobile)
Original comment by zack...@gmail.com
on 19 Sep 2014 at 11:32
Original issue reported on code.google.com by
cmnico...@gmail.com
on 5 Sep 2014 at 5:54