Questions about Question 4 (all v. all)

tjp36 commented 7 years ago

Hi,

I have a few questions about Question 4 (the all v. all question).

1) I am assuming that the program needs to run on the RCC. Is this correct?

2) As far as I am aware, the local/global sequence alignment code we wrote a few weeks ago did not have a PAM250 or BLOSUM62 option (or at least mine didn't...). All that code did was take in two sequences, a match score, a mismatch score, and a gap penalty, and output a matrix along with the best alignment score. I am unsure as to how to incorporate the different scoring matrices into my code as it exists now. Which leads me to my next question.

3) Should the code for this question be using the code we wrote for the Google App engine a few weeks ago, or should it be using blastp command line commands? To be more clear, which of the following two is correct: a) The program starts reading the database. It starts out with the first sequence. It then runs a global/local alignment that we wrote a few weeks ago on every sequence in the database. At this point the first row is filled in. It then moves to the next row. etc. b) The program starts reading the database. It starts out with the first sequence. For each sequence, a command similar to blastp -query sequence1.fa -subject sequence2.fa is run.

4) Should there be an option to specify global or local alignment in our program? Or should it just run global.

Thanks,

Ted

grantcupps commented 7 years ago

Hey Ted,

Here is how I interpreted the question:

It needs to run on the RCC
We should update our code from the Google App engine to use PAM250 or BLOSUM62 instead of a match / mismatch score.
We should run global or local alignment on every possible pair of sequences in the database.

I do have some additional questions:

Should there be a flag for global / local alignment?
How should the result be sorted?
What should the flag be for the second output file?

tjp36 commented 7 years ago

Grant,

Thanks for the response. So for PAM250, are you using a table like this:

http://prowl.rockefeller.edu/aainfo/pam250.htm

to calculate your scores? E.g. an R matched with A would get a score of -2, an R matched with an R would get a score of 6, etc., and a gap would be scored as whatever you specified as a command line argument?

grantcupps commented 7 years ago

That's how I interpretted the question, but I could be completely mistaken...

FYI, the NIH has an FTP site with the raw tables for each scoring matrix: ftp://ftp.ncbi.nih.gov/blast/matrices

tabinks commented 7 years ago

1) I am assuming that the program needs to run on the RCC. Is this correct? Yes. You should take advantage of running on RCC. The exact way you do this is up to you. There is no 'right' answer, its what ever strategy you are most interested in exploring.

2) As far as I am aware, the local/global sequence alignment code we wrote a few weeks ago did not have a PAM250 or BLOSUM62 option (or at least mine didn't...). All that code did was take in two sequences, a match score, a mismatch score, and a gap penalty, and output a matrix along with the best alignment score. I am unsure as to how to incorporate the different scoring matrices into my code as it exists now. Which leads me to my next question.

Correct, we implemented a simple scoring system. You should read in both PAM and BLOSUM matrix (or hardcode them in your software). Here is an example of hardcoding: http://biopython.org/DIST/docs/api/Bio.SubsMat.MatrixInfo-pysrc.html

3) Should the code for this question be using the code we wrote for the Google App engine a few weeks ago, or should it be using blastp command line commands? To be more clear, which of the following two is correct:

Yes, the majority of you code should be from the app engine assignment. That part of the code shouldn't need to change.

a) The program starts reading the database. It starts out with the first sequence. It then runs a global/local alignment that we wrote a few weeks ago on every sequence in the database. At this point the first row is filled in. It then moves to the next row. etc.

b) The program starts reading the database. It starts out with the first sequence. For each sequence, a command similar to blastp -query sequence1.fa -subject sequence2.fa is run.

There are different ways of doing this. One would be this: 1) Read in the database 2) Iterate through one sequence at a time and run your alignment against the entire database. For example (pseudocode):

foreach seq in database:
  execute 'align.py sequence1 database' > output1.txt

3) Collate all the results files and then printout. For example (pseudocode):

foreach i in database.count:
  readfile('output'+i)

In step2, this would be a good place to push out each alignment job to some nodes to do the alignment.

This is just one way. I'm most interested in everyone getting some experience thinking about how to design and implement a workflow using HPC.

4) Should there be an option to specify global or local alignment in our program? Or should it just run global.

It should be able to run either, using a command line flag.

tabinks commented 7 years ago

@uchicago-bio/mpcs56420-2016-autumn Good questions about the assignment above.

uchicago-bio / 2016-Autumn-Forum

Questions about Question 4 (all v. all) #48