mattb112885 / clusterDbAnalysis

ITEP - Integrated Toolkit for Exploration of microbial Pan-genomes
26 stars 15 forks source link

sqlite3 in setup_step1.sh incredibly slow #73

Closed jnesme closed 7 years ago

jnesme commented 8 years ago

Hi ITEP devs,

I'm currently trying to use ITEP using a local installation on my server. setup_step1.sh is running smoothly and allvsall blast computation is pretty fast while afterwards the rebuilding database phase takes forever: blast computation ended at 22:48 yesterday judging by last file created and sqlite3 process is running since (i.e. soon it will be 12H...). The job isn't stalled since I can see that it's getting written on DATABASE.sqlite file (40GB for now). I'm currently comparing 25 bacterial genomes.

Is this just normal or am I having a problem of I/O server side ?

JosephRyanPeterson commented 8 years ago

Hello jnesme,

While not ideal, this is to be expected. For my genome set (~60) it takes about 48 hours to complete. While the blast portion of the ITEP database construction can be done in parallel, currently the construction of the SQL database relies on sqlite3 (which I don't believe is parallel).

~jrp

mattb112885 commented 8 years ago

That is slower than I remember, but it has been a while since I rebuilt a db. Unfortunately it is not possible to parallelize inserts in sqlite.

One thing I wonder is that SQLite has introduced a new journaling mechanism (and should be available in the version the VM has installed)

https://www.sqlite.org/wal.html

It may be worth it to investigate if turning this on improves the write performance during the database setup steps. If we are doign this we would want to separate the inputs of different sets of blast results so the WAL file doesn't get too big.

jnesme commented 8 years ago

Ok, well apparently the task finished, total time was around 25H. I was worried since we have some issues on our computer cluster right now.

Thanks a lot for the answers.