[WIP] Split PBlib.py and PDBlib.py into sub-modules to improve modularization

pierrepo / PBxplore

A suite of tools to explore protein structures with Protein Blocks :snake:

https://pbxplore.readthedocs.org/en/latest/

MIT License

28 stars 17 forks source link

[WIP] Split PBlib.py and PDBlib.py into sub-modules to improve modularization #80

Closed HubLot closed 8 years ago

HubLot commented 8 years ago

Like said previously, with the most of the refactoring done, the PBlib.py file has grown very large. I propose a separation of it into sub-modules based on #25. Don't merge it yet because the scripts have not been modified.

Here, a suggestion of architecture. The spitting was not so obvious so don't hesitate to bring your opinions to improve it. (bold state for folders and italic for functions/classes/constants)

pbxplore
- analysis
- clustering.py: _distancematrix, __matrix_tostr, hclust
- compare.py: _matrix_to_singledigit, _compare_to_firstsequence
- neq.py: __neq_perresidue, _computeneq, _writeneq
- core
- PB.py: REFERENCES, NAMES, _SUBSTITUTION_MATRIXNAME, _load_substitutionmatrix, _writeflat, _compute_score_byposition, _substitutionscore
- PBs_substitution_matrix.dat
- assignement.py: __angle_modulo360, assign, __write_phipsientry, _writephipsi
- count.py: __assert_samesize, _countmatrix, _read_occurencefile (shouldbe rename in _read_countfile), _write_countmatrix
- formats
- PDB.py: PDB
- fasta.py: _FASTAWIDTH, _readfasta, _read_severalfasta, __write_fastaentry, _writefasta
- loader.py: _chains_fromfiles, _chains_fromtrajectory
- structure.py: Atom, Chain, _getdihedral
- scripts will contain later PBassign.py, PBcount.py, PBclust.py and PBstat.py
- utils.py: __slicematrix, _compute_freqmatrix
- visualization.py: _plotneq, _plotmap, _generateweblogo
- init.py to expose some critical functions directly to the users.

The idea was to separate:

the core functions which are intern to pbxplore (the dihedral assignement, the matrix of count, etc)
formats pbxplore handle: PDB, fasta, xtc. I'm not entirely satisfied by this sub-modules because it will be hard to define how to expose it to the user (more on that later). For example, functions inside loader.py should be directly accessible from pbxplore (pbxplore.chains_from_files()). Maybe also it lakes of proper API functions.
the analysis functions. See the __init__.py for the exposed functions.
the visualizations functions. Since there is no so much here, I didn't want to make a folder.

From this architecture, the main question/issue I struggle is how to expose the functions inside the submodules core and formats to the API user and I need your opinions. That's why I didn't change the scripts (hence the fail on tests). I think for analysis and visualization, it's pretty clear: the user will import the submodule to use the functions. For example, functions chains_from_files, chains_from_trajectory, assign should be directly accessible from pbxplore main module (pbxplore.my_function) but what about the other ones ? Should all be exposed to the main module ? If not, which ones ?


    # PBassign example
    import pbxplore

    # Maybe this whole block could be turn into a API function
    chains = pbxplore.chains_from_files(pdb_name_lst)
    all_comments = []
    all_sequences = []
    all_dihedrals = []
    for comment, chain in chains:
        dihedrals = chain.get_phi_psi_angles()
        sequence = pbxplore.assign(dihedrals)
        all_comments.append(comment)
        all_dihedrals.append(dihedrals)
        all_sequences.append(sequence)

    with open("fasta", 'w') as outfile:
        #What is the best line ?
        pbxplore.write_fasta(outfile, all_sequences, all_comments)
        pbxplore.formats.fasta.write_fasta(outfile, all_sequences, all_comments)
        pbxplore.fasta.write_fasta(outfile, all_sequences, all_comments)
        # ...

     with open("flat", 'w') as outfile:
        #Same here
         pbxplore.write_flat(outfile, all_sequences)
         pbxplore.formats.PB.write_flat(outfile, all_sequences)
        # ....

Again, maybe it's not clear because either I split those functions in wrong way or either it lakes of "global" API functions. Also, the number of read/write functions on different objects (count ,fasta, phipsi, flat) does not ease it and for this case, it's already API functions.

I'm all ears.

pierrepo commented 8 years ago

Hi @HubLot. You're on fire today! I propose visualization.py goes into the analysis directory since map or neq are kinds of analysis. One way to make things easier would be to expose to the main API all functions inside core and format.

jbarnoud commented 8 years ago

Yeah @HubLot ! Great !

I do not know how things should be on the file side, but on the API side I would avoid having to many levels. I would suggest to remove the penultimate level of your hierarchy, then pbxplore.analysis.neq.compute_neq would become pbxplore.analysis.compute_neq. It can be done with your file organization by importing the content of pbxplore.analysis.neq in pbxplore.analysis.__init__.

I would prefer to have all function related to file (excepted the visualization stuff) in pbxplore.formats that may be rename pbxplore.fileio or pbxplore.io. This include functions such as write_neq and write_phipsi.

I suggest to rename pbxplore.core.PB into pbxplore.data, to move compute_score_by_position, and substitution_score in pbxplore.analysis, and to move write_flat in pbxplore.formats.

pbxplore.formats.PDB and pbxplore.formats.structure could be merged in pbxplore.structure. I do not know what to do with get_dihedral, though.

count.py could go in analysis, therefore core could become assign.

It may look like a lot of change, but then the package is organized the same way I see PB used: from a structure or a trajectory (pbxplore.structure), PB sequences are assigned (pbxplore.assign), then analyzed (pbxplore.analysis); the output of the analysis is written into a file (pbxplore.formats) or plotted in some way (pbxplore.vizualise). If somebody wants to use an other PDB parser, then it will not use pbxplore.structure. If the PB sequences are already assigned, then pbxplore.structure and pbxplore.assign are not useful. pbxplore.data limits the need for imports across the module, it also allows a user to write new applications of the PB (like PB prediction of structure building) without touching the rest of pbxplore.

Finally, this really needs a shortcut:

chains = pbxplore.chains_from_files(pdb_name_lst)
all_comments = []
all_sequences = []
all_dihedrals = []
for comment, chain in chains:
    dihedrals = chain.get_phi_psi_angles()
    sequence = pbxplore.assign(dihedrals)
    all_comments.append(comment)
    all_dihedrals.append(dihedrals)
    all_sequences.append(sequence)

HubLot commented 8 years ago

Thanks guys for your inputs.

I do not know how things should be on the file side, but on the API side I would avoid having to many >levels. I would suggest to remove the penultimate level of your hierarchy, then >pbxplore.analysis.neq.compute_neq would become pbxplore.analysis.compute_neq. It can be done >with your file organization by importing the content of pbxplore.analysis.neq in >pbxplore.analysis.init.

It's already done in this PR ;)

I agree with the view of @jbarnoud about the changes with minor revisions (ahah). I think some "core" functions need to be directly accessible from the main module (i.e chains_from_files, chains_from_trajectory and assign). Then the user will load analysis and/or io to perform what he wants. I agree also with @pierrepo about visualization. I think I can move it to analysis (to reduce the number of sub-modules to load) and access the functions like this pbxplore.analysis.plot_neq

pierrepo commented 8 years ago

OK. But try to be KISS.

https://www.youtube.com/watch?v=q7fxN3g5sLw

HubLot commented 8 years ago

I made changes following our discussion. I also updated the scripts (PBassign.py, etc) to give you an idea.

pbxplore
- analysis
- clustering.py: _distancematrix, __matrix_tostr, hclust
- compare.py: _matrix_to_singledigit, _compare_to_firstsequence
- count.py: __assert_samesize, _countmatrix, _read_occurencefile
- neq.py: __neq_perresidue, _computeneq
- utils.py: __slicematrix, _compute_freqmatrix, _compute_score_byposition, _substitutionscore
- visualization.py: _plotneq, _plotmap, _generateweblogo
- all functions that don't start with '_' are exposed to analysis(pbxplore.analysis.compute_neq for example)
- io
- fasta.py: _FASTAWIDTH, _readfasta, _read_severalfasta, __write_fastaentry, _writefasta
- write.py: __write_phipsientry, _writephipsi, _writeflat, _write_countmatrix, _writeneq
- structure:
- PDB.py: PDB
- loader.py: _chains_fromfiles, _chains_fromtrajectory
- structure.py: Atom, Chain, _getdihedral
- PB.py: REFERENCES, NAMES, _SUBSTITUTION_MATRIXNAME, _load_substitutionmatrix,
- PBs_substitution_matrix.dat
- assign.py: __angle_modulo360, assign,
- scripts will contain later PBassign.py, PBcount.py, PBclust.py and PBstat.py
- init.py to expose some critical functions directly to the users (chains_from_files, chains_from_trajectory, assign and PB module)

I rename the folder formats by structure but I kept the structure and the files in it. In fact, it's mainly internal parser and the user should only know about chains_from_files and chains_from_trajectory.

I followed the suggestions for PB.py but I kept its name because this one is more recognizable (PB = Protein Blocks) than data (data is a good name for variable, not so much for modules). Overall, the changes made the architecture and the use of the module clearer and easier.

But, if you look into the scripts, I'm not satisfied with the import needed. Right now, we need 3 imports and the length to the function calls are long. Maybe @pierrepo was right by exposing all functions to the main module :)

pierrepo commented 8 years ago

Well well well. Can we have the week-end to think about it?

jbarnoud commented 8 years ago

I looked at PBassign.py and liked it much. I do not see a problem with the 3 lines of import. It is fairly common, and I find it neater: you know right from the imports what the script is about. Also, having things split that way should make things easier with optional dependencies.

Are the function call really much longer? pbxplore.io.write_fasta is 3 characters longer than pbxplore.write_fasta The names that are a bit long are those of the constants in pbxplore.structure. They are not used that often, if they are they can be imported as from pbxplore.structure import THE_LONG_CONSTANT_NAME.

HubLot commented 8 years ago

I looked at PBassign.py and liked it much. I do not see a problem with the 3 lines of import. It is fairly > common, and I find it neater: you know right from the imports what the script is about. Also, having > things split that way should make things easier with optional dependencies.

I could agree for the "analysis" sub-modules but not so much for the others (structure, io). 3 imports is huge imho. Numpy, matplotlib, mdtraj have mainly one import (of course, they are counter examples :) ) and they are pretty clear.

Are the function call really much longer? pbxplore.io.write_fasta is 3 characters longer than > pbxplore.write_fasta

Indeed but with one main import, we could do:

import pbxplore as pbx
pbx.write_fasta()

jbarnoud commented 8 years ago

@HubLot If you want to reduce the number of imports, you can add some imports in pbxplore.__init__. If you have import .io in the __init__ you should be able to access pbxplore.io from the import pbxplore.

I agree that the most common functions should have shortcuts, but not all functions or it will quickly become a mess.

HubLot commented 8 years ago

@HubLot If you want to reduce the number of imports, you can add some imports in pbxplore.__init__. If you have import .io in the __init__ you should be able to access pbxplore.io from the import pbxplore.

Indeed, this is already the case for the sub modules PB and we could add io and structure too. Then, the shortcut import pbxplore as pbx will work fine ( i.e pbx.io.write_fasta())

HubLot commented 8 years ago

I made few changes based on the discussion. As suggested by @jbarnoud, the submodules are exposed directly into the main pbxplore module through the pbxplore/__init__.py. I also exposed the submodule analysis to keep consistency between submodules.

Now, only one import is required (import pbxplore as pbx) but the hierarchy/level of the submodules are kept:

import pbxplore as pbx
pbx.assign()
pbx.chain_from_files()
pbx.io.write_fasta()
pbx.analysis.compute_neq()

I have also reflects those changes into the tests to pass travis checks

pierrepo commented 8 years ago

OK. I merge the beast.