uchicago-bio / 2015-Autumn-Forum

0 stars 0 forks source link

Tips, Question Changes and Clarifications #52

Closed tabinks closed 8 years ago

tabinks commented 8 years ago

Question 1

The version of mpiblast that is on RCC is not compatible with the nr database. Apparently, mpiblast is not very popular and hasn't been recompiled in a while. I have created an mpiblast database based on the Protein Data Bank. It is located at /project/mpcs56420/databases/pdb/. You will need to copy all of the files pdb.fasta* to a directory that can be read by nodes (e.g. /scatch/midway/).

I have prepared a tutorial to help you set up and run an mpiBlast job using this pdb database. Here is the sbatch script used in the tutorial.

Since you will not be able to directly compare the results to the blastplus results from last week, I have modified the questions. Please view the revised homework here.

tabinks commented 8 years ago

Question 3

I have put together a template script for using python multiprocessing library in a slurm environment. This script iterates through 100 task using 1 node and 16 cores on a sandy bridge processor. The function process_worker receives a different parameter during each iteration. You can imagine using this to specify which chunk of a database your should search against.

This is not a complete solution to the questions, but it is a great place to start.

tabinks commented 8 years ago

Question 2

I added the formula for speedup. You will need to run a serial version of the database search to get the baseline run time. Remember to use the pdb.fasta database, not the nr database.

ghost commented 8 years ago

Should we still build the nr database for question #1 or use the pdb database? (Sorry if this question seems slack--there was another compilers deadline this weekend and now I'm catching up)

tabinks commented 8 years ago

No. Do not try to build nr. Just use the pdb database.