Improve PBassign modularity

jbarnoud commented 9 years ago

In PR #46, I moved the assignment logic from PBassign.py to PBlib.py. Here, I slightly refactor PBassign.py so there is as little code outside of functions as possible. This makes it possible to import PBassign without border effect.

This PR also factorize the iteration over chains in two functions: PDBlib.chains_from_files and PDBlib.chains_from_trajectory. These functions are generators that yield chains either from a list of files or from a trajectory. The PR also get rid of some writing functions: PBlib.write_fasta is now replaced by PBlib.write_fasta_entry, PBlib.write_phipsi is now replaced by PBlib.write_phipsi_entry, PBlib.write_flat is replaced by just print, and PBlib.clean_file is just discarded as it is not useful anymore. See commit messages to more details.

The PR also do some cleaning.

pierrepo commented 9 years ago

Nice work @jbarnoud. I believe the introduction of the _NullContext class albeit clever is overkill and not trivial to understand. Could you please revert to the previous way files were handled? Could you please rename PBlib.write_phipsi_entry() to PBlib.write_phipsi_file() and PBlib.write_fasta_entry() to PBlib.write_fasta_file()?

jbarnoud commented 9 years ago

I just pushed a commit that renames PBlib.write_fasta_entry and PBlib.write_phipsi_entry to PBlib.write_fasta and PBlib.write_phipsi, respectively. Now I realize that I misread your comment... Don't you think PBlib.write_fasta_file is misleading as a name? The function does not write a fasta file per se, it writes an fasta entry in a file. This looks similar when you have only one entry in the file. Yet, as soon as you have two entries, writing a fasta file or writing a fasta entry become different things. So I can rename the functions, I am just not sure PBlib.write_phipsi_file and PBlib.write_fasta_file are the best names. Could you argue on this?

I am looking for a way to remove the _NullContext class and to stay simple. For now, I do not see any. Maybe @HubLot has ideas on the matter. Before I introduced the class in 19fad87, files were opened in append mode before writing each entry and then closed. We should not do that. I changed the code to open the files only once and to pass the open files to the writing function. I also had the files handled in a context manager. Yet, some outputs are optional and you cannot pass None to the with statement. This is why I introduced the _NullContext that is a context manager that does nothing.

The options I see to get rid of the _NullContext are:

create a context manager that handle the 3 file descriptors, and knows what to do if the files are not provided. This could be nice to play with, but may not be simpler.
get rid of the with statement and do the file handing manually. Yet, having the try..except logic there to handle one file and two file-or-None may not be the easier to read either.

I am open to suggestions.

pierrepo commented 9 years ago

@jbarnoud I agree with your comment regarding PBlib.write_phipsi_file() vs PBlib.write_phipsi_entry(). PBlib.write_phipsi_entry() is indeed the best option.

However, could you explain why is it wrong to open / write / close file for each entry? Python has a nice way to handle file opening and disk writing.

jbarnoud commented 9 years ago

I see several reason why one should prefer open a file only once:

open the file in append mode is error prone: in a public API you can be sure somebody will forget to clear the file before starting to write it;
when you close a file you release the lock on the file, meaning an other program can mess with what you are doing;
performance wise, opening and closing a file has a cost; also, you loose the benefit of python smart buffering strategies when you close the file.

You can have a look at this notebook if you are not convinced about the last point.

pierrepo commented 9 years ago

OK. The main point being for me the "public API". So we open file once, write entries several times and close file once. Could you find a simpler alternative to the _NullContext class?

HubLot commented 9 years ago

I agree with @pierrepo about the _NullContext class. It's clever to avoid the multiple I/O but odd to use.

In my opinion, the PB_assign function does too much things, i.e assign PB and write files. One solution could be to split it:

The new PB_assign will compute dihedrals & assign PB for all chains. The return would be a tuple containing the list of PB sequences and the list of dihedrals. (Also the name is quite confusing. There is a PBlib.assign()and a PBassign.PB_assign(). Maybe "compute_PB" or "get_PB". Or one "get_PB_from_dihedrals" and the other "get_PB_from_structures" or something else.)
A new function write_things which take in input the return values of the previous one and write the wanted values (fasta, dihedral, flat) with 3 context managers.

This way the list of PB sequences and dihedrals will be available to the user through the API and the write stuff is separated. The only downside is to store the whole PB sequences but it's already done in PBclust.py

jbarnoud commented 9 years ago

I assume @HubLot suggests we could write the files at the end of the loop. It would be something like:

all_comments = []
all_sequences = []
all_dihedrals = []
for comment, chain in chains:
    dihedrals = chain.get_phi_psi_angles()
    sequence = PB.assign(dihedrals, pb_ref)
    all_comments.append(comment)
    all_dihedrals.append(dihedrals)
    all_sequences.append(sequences)

with open(fasta_name, 'w') as outfile:
    write_all_fasta(all_comments, all_sequences, outfile)
if option.flat:
    with open(flat_name, 'w') as outfile:
        write_all_flat(all_comments, all_sequences, outfile)
if option.phipsi:
    with open(phipsi_name, 'w') as outfile:
        write_all_phipsi(all_comments, all_dihedrals, outfile)

I am not a huge fan on dealing with large lists, and the lists can be very large if we deal with trajectories. But I agree that it is simpler than the _NullContext solution.

HubLot commented 9 years ago

Indeed. I'm not a huge fan either but I think the RAM will not be an issue.

In the meantime, maybe the function could return an iterator which will be given to the writing function, avoiding large lists unless the user want to (by a list(my_iterator))

pierrepo commented 9 years ago

Me too. Not a big fan of list but as mentionned by @HubLot the RAM might not be an issue. For better performance, we could use tuples instead of lists.

@jbarnoud I like the implementation you proposed maybe enhanced with iterators. Also simplify the name of the write functions : write_fasta(), write_flat() and write_phipsi().

pierrepo commented 9 years ago

Hello @jbarnoud. Can you PR the new implementation when you will have time? This week, I had the chance to meet @HubLot in Paris.

jbarnoud commented 9 years ago

@pierrepo the last 3 commits remove the NullContext trick and implement what I described in my previous comment. Note that the memory consomption may be huge when working with trajectories.

pierrepo commented 9 years ago

@jbarnoud You did an awesome job. Thanks!!

pierrepo / PBxplore

Improve PBassign modularity #65