ECDP not working correctly

GoogleCodeExporter commented 8 years ago

Hi:
  I'm not seeing all the runs in ECDP and some of the ones that are displaying are not showing proper designations (i.e. analyses are done and not showing).  Also seems like a good time to put a notice that the system is having some issues, as opposed to the default launch screen saying there are none.  Thanks!

Original issue reported on code.google.com by cmnico...@gmail.com on 5 Sep 2014 at 5:54

GoogleCodeExporter commented 8 years ago

1) maybe we need another update 
or
2) restore to last good copy and then update
or
3)
full rebuild.

Original comment by zack...@gmail.com on 5 Sep 2014 at 5:59

GoogleCodeExporter commented 8 years ago

I am going to try approach 2) first.
I am going to put a notice saying : 
"We are working on reported issues. Sorry for inconvenience."

Original comment by natalia....@gmail.com on 5 Sep 2014 at 6:53

GoogleCodeExporter commented 8 years ago

There were 3 insertQC.pl processes running, from previous days probably. Killed 
them, restored db to August 25 copy and started update

Original comment by natalia....@gmail.com on 5 Sep 2014 at 7:42

GoogleCodeExporter commented 8 years ago

Yeah, That was probably the problem.

We should check if another process is running and die if true, to prevent this

Original comment by zack...@gmail.com on 5 Sep 2014 at 7:51

GoogleCodeExporter commented 8 years ago

How about making sequencingDevel_Everyday.sh look like this:

#!/bin/sh
p=`ps -ef | grep insertQC.pl | awk END'{print NR}'`;
if [ $p -eq 1 ] 
then 

./insertQC.pl 4

fi

Original comment by natalia....@gmail.com on 5 Sep 2014 at 8:37

GoogleCodeExporter commented 8 years ago

Storage is still slow. It takes 3 sec for ls /storage to return

Original comment by natalia....@gmail.com on 5 Sep 2014 at 8:42

GoogleCodeExporter commented 8 years ago

unfortunately this is somewhat expected since a lot of disk operations are 
happening on the head node right now.

Original comment by zack...@gmail.com on 5 Sep 2014 at 9:17

GoogleCodeExporter commented 8 years ago

I started the update at 12:40 and the process is hanging since 18:22. I am 
curious when storage is going to be back to normal. I disabled the update since 
the current one is hanging.

Original comment by natalia....@gmail.com on 6 Sep 2014 at 3:42

GoogleCodeExporter commented 8 years ago

is it completely hung or just very slow, there is a big difference between the 
two.

you can ssh into hpc-laird and see the load. it looks as if moiz is backing up 
hundreds of gigs.

Original comment by zack...@gmail.com on 6 Sep 2014 at 3:46

GoogleCodeExporter commented 8 years ago

Issue 807 has been merged into this issue.

Original comment by zack...@gmail.com on 6 Sep 2014 at 3:46

GoogleCodeExporter commented 8 years ago

also, have you checked the load on the other VM. that is the storage gateway. 
if there are many stalled rsyncs happening there, then it is definitely part of 
the problem.

Original comment by zack...@gmail.com on 6 Sep 2014 at 3:49

GoogleCodeExporter commented 8 years ago

nevermind, I checked and there were some rsyncs going, i killed them

Original comment by zack...@gmail.com on 6 Sep 2014 at 3:51

GoogleCodeExporter commented 8 years ago

update log had not moved for 2.5 hours, this was not good. 

Once I cleaned up some rsyncs etc, it started moving again

Original comment by zack...@gmail.com on 6 Sep 2014 at 3:54

GoogleCodeExporter commented 8 years ago

looks like things sped up big time. Moiz moved over to hpc-uec for his backups. 

anyways, update finished, please verify that things are working correctly.

Original comment by zack...@gmail.com on 6 Sep 2014 at 7:53

GoogleCodeExporter commented 8 years ago

Looks normal to me , so I am removing the notice

Original comment by natalia....@gmail.com on 6 Sep 2014 at 10:44

GoogleCodeExporter commented 8 years ago

beta is still messed up

Original comment by zack...@gmail.com on 9 Sep 2014 at 5:57

GoogleCodeExporter commented 8 years ago

started rebuild

On Mon, Sep 8, 2014 at 10:57 PM, <usc-epigenome-center@googlecode.com>
wrote:

Original comment by natalia....@gmail.com on 9 Sep 2014 at 7:07

GoogleCodeExporter commented 8 years ago

The rebuild failed at 1 AM with the message "DBD::mysql::st execute failed: 
MySQL server has gone away at ./insertQC.pl line 694."
And it was running for 12 hours, which is slow for parallel rebuild.  I guess 
somebody transfered their data again yesterday which slowed down the update.

Original comment by natalia....@gmail.com on 10 Sep 2014 at 6:36

GoogleCodeExporter commented 8 years ago

The rebuild was failing for two days because of the db connection timeout. When 
running a parallel rebuild the update script establishes a db connection and 
then calls itself recursively. The established connection remains idle until 
all the threads are finished. The max default time for keeping connection alive 
in mysql is set to 8 hours. Our update was running for more then 12 hours, 
which caused the mysql server to close the connection.
To prevent this in the future, it's better to close the connection before 
spawning the threads and re-open it after all the threads are done.

Original comment by natalia....@gmail.com on 12 Sep 2014 at 7:44

GoogleCodeExporter commented 8 years ago

The question is what caused the update to run for over 12 hours? Was it related 
to the migration of epifire2 to the new VM?

Original comment by natalia....@gmail.com on 19 Sep 2014 at 11:30

GoogleCodeExporter commented 8 years ago

High load on storage servers

-zack (mobile)

Original comment by zack...@gmail.com on 19 Sep 2014 at 11:32

uec / Issue.Tracker

ECDP not working correctly #808