kosta777 / parallel-genomeseq

Parallelization of popular genome sequencing algorithms
4 stars 1 forks source link

Benchmarking queries against UNIPROT Database #28

Open kosta777 opened 4 years ago

kosta777 commented 4 years ago
  1. Download uniprot_sprot.fasta.gz from https://www.uniprot.org/downloads under UniprotKB/Reviewed(Swiss-Prot).
  2. Gunzip the file and put it in project/data/uniprot directory
  3. On Leonhard cluster - copy the whole project to /cluster/scratch/username/ directory since you will have many files which is not allowed in your home directory on the cluster
  4. Run python reader.py uniprot_prepare from py directory (this will generate many files in data/uniprot, with 1 read from the database each, and stats.txt file in data/uniprot with one number - showing the number of files containing reads)
  5. If step 4 is taking too long, you can stop it at some point and manually create stats.txt file after checking out what is the number of files generated until the stopping point
  6. Create a job from the binary bin/mpi_sw_solve_uniprot
  7. After the job is completed check the standard output of the job and each rank should have printed run time in microseconds spent on updating cells, and the amount of cells updated total (this part needs to be further automated)

**The protein used as a query is in data/query/ and you can replace it with any protein you want

Disclaimer It was a long evening, please pay attention to whether I made any obvious mistakes and correct them or contact me if you see I did.