Closed jbarnoud closed 9 years ago
Nice work @jbarnoud. I believe the introduction of the _NullContext class albeit clever is overkill and not trivial to understand. Could you please revert to the previous way files were handled? Could you please rename PBlib.write_phipsi_entry() to PBlib.write_phipsi_file() and PBlib.write_fasta_entry() to PBlib.write_fasta_file()?
I just pushed a commit that renames PBlib.write_fasta_entry
and PBlib.write_phipsi_entry
to PBlib.write_fasta
and PBlib.write_phipsi
, respectively. Now I realize that I misread your comment... Don't you think PBlib.write_fasta_file
is misleading as a name? The function does not write a fasta file per se, it writes an fasta entry in a file. This looks similar when you have only one entry in the file. Yet, as soon as you have two entries, writing a fasta file or writing a fasta entry become different things. So I can rename the functions, I am just not sure PBlib.write_phipsi_file
and PBlib.write_fasta_file
are the best names. Could you argue on this?
I am looking for a way to remove the _NullContext
class and to stay simple. For now, I do not see any. Maybe @HubLot has ideas on the matter. Before I introduced the class in 19fad87, files were opened in append mode before writing each entry and then closed. We should not do that. I changed the code to open the files only once and to pass the open files to the writing function. I also had the files handled in a context manager. Yet, some outputs are optional and you cannot pass None to the with statement. This is why I introduced the _NullContext
that is a context manager that does nothing.
The options I see to get rid of the _NullContext
are:
with
statement and do the file handing manually. Yet, having the try..except
logic there to handle one file and two file-or-None may not be the easier to read either.I am open to suggestions.
@jbarnoud I agree with your comment regarding PBlib.write_phipsi_file() vs PBlib.write_phipsi_entry(). PBlib.write_phipsi_entry() is indeed the best option.
However, could you explain why is it wrong to open / write / close file for each entry? Python has a nice way to handle file opening and disk writing.
I see several reason why one should prefer open a file only once:
You can have a look at this notebook if you are not convinced about the last point.
OK. The main point being for me the "public API". So we open file once, write entries several times and close file once. Could you find a simpler alternative to the _NullContext class?
I agree with @pierrepo about the _NullContext class. It's clever to avoid the multiple I/O but odd to use.
In my opinion, the PB_assign
function does too much things, i.e assign PB and write files. One solution could be to split it:
PB_assign
will compute dihedrals & assign PB for all chains. The return would be a tuple containing the list of PB sequences and the list of dihedrals. (Also the name is quite confusing. There is a PBlib.assign()
and a PBassign.PB_assign(). Maybe "compute_PB" or "get_PB". Or one "get_PB_from_dihedrals" and the other "get_PB_from_structures" or something else.)write_things
which take in input the return values of the previous one and write the wanted values (fasta, dihedral, flat) with 3 context managers.This way the list of PB sequences and dihedrals will be available to the user through the API and the write stuff is separated. The only downside is to store the whole PB sequences but it's already done in PBclust.py
I assume @HubLot suggests we could write the files at the end of the loop. It would be something like:
all_comments = []
all_sequences = []
all_dihedrals = []
for comment, chain in chains:
dihedrals = chain.get_phi_psi_angles()
sequence = PB.assign(dihedrals, pb_ref)
all_comments.append(comment)
all_dihedrals.append(dihedrals)
all_sequences.append(sequences)
with open(fasta_name, 'w') as outfile:
write_all_fasta(all_comments, all_sequences, outfile)
if option.flat:
with open(flat_name, 'w') as outfile:
write_all_flat(all_comments, all_sequences, outfile)
if option.phipsi:
with open(phipsi_name, 'w') as outfile:
write_all_phipsi(all_comments, all_dihedrals, outfile)
I am not a huge fan on dealing with large lists, and the lists can be very large if we deal with trajectories. But I agree that it is simpler than the _NullContext
solution.
Indeed. I'm not a huge fan either but I think the RAM will not be an issue.
In the meantime, maybe the function could return an iterator which will be given to the writing function, avoiding large lists unless the user want to (by a list(my_iterator
))
Me too. Not a big fan of list but as mentionned by @HubLot the RAM might not be an issue. For better performance, we could use tuples instead of lists.
@jbarnoud I like the implementation you proposed maybe enhanced with iterators. Also simplify the name of the write functions : write_fasta(), write_flat() and write_phipsi().
Hello @jbarnoud. Can you PR the new implementation when you will have time? This week, I had the chance to meet @HubLot in Paris.
@pierrepo the last 3 commits remove the NullContext trick and implement what I described in my previous comment. Note that the memory consomption may be huge when working with trajectories.
@jbarnoud You did an awesome job. Thanks!!
In PR #46, I moved the assignment logic from
PBassign.py
toPBlib.py
. Here, I slightly refactorPBassign.py
so there is as little code outside of functions as possible. This makes it possible to importPBassign
without border effect.This PR also factorize the iteration over chains in two functions:
PDBlib.chains_from_files
andPDBlib.chains_from_trajectory
. These functions are generators that yield chains either from a list of files or from a trajectory. The PR also get rid of some writing functions:PBlib.write_fasta
is now replaced byPBlib.write_fasta_entry
,PBlib.write_phipsi
is now replaced byPBlib.write_phipsi_entry
,PBlib.write_flat
is replaced by justprint
, andPBlib.clean_file
is just discarded as it is not useful anymore. See commit messages to more details.The PR also do some cleaning.