pierrepo / PBxplore

A suite of tools to explore protein structures with Protein Blocks :snake:
https://pbxplore.readthedocs.org/en/latest/
MIT License
28 stars 17 forks source link

visualization functions fail when the first residue of the count matrix is not equal to 1 #148

Closed HubLot closed 7 years ago

HubLot commented 7 years ago

Hi,

We missed an important bug when writing your visualization library. As you know, sometimes the first residue of a PDB is not 1. We did take that in account in the PBcount and write_count_matrix() by adding a number of the first residue as an optional parameter (1 by default). It works great and the count matrix can start with a different number (see this test file for example https://github.com/pierrepo/PBxplore/blob/master/pbxplore/tests/test_data/count_multi1_first20.PB.count).

Unfortunately, we didn't take it in account properly for the visualizations tools (PBstat and the plot functions) :

$ PBstat -f pbxplore/tests/test_data/count_multi1_first20.PB.count -o test --map
Index of first residue is: 20
Traceback (most recent call last):
  File "pbxplore/bin/PBstat", line 9, in <module>
    load_entry_point('pbxplore', 'console_scripts', 'PBstat')()
  File "pbxplore/scripts/PBstat.py", line 173, in pbstat_cli
    pbx.analysis.plot_map(file_fig_name, count, residue_min, residue_max)
  File "pbxplore/analysis/visualization.py", line 103, in plot_map
    freq = utils._slice_matrix(freq_mat, residue_min, residue_max)
  File "/pbxplore/analysis/utils.py", line 57, in _slice_matrix
    raise IndexError("Index out of range")
IndexError: Index out of range

The issue lies in the _slice_matrix() function (in utils.py) were we mixed the index of the first residue from the count matrix (in PBcount) and the range of residue optionally provided in Pbstat :

The numpy matrix (mat) doesn't know about the index of the first residue (and nor it should). It always start at 0. When the first residue (in the count matrix) has an index of 1, it works well. We create a range based on it and slice the matrix according to residue-min/residue-max of PBstat (or the optional arguments in the plot functions). However, when it's different of 1 (say 252), we base our slicing on this index. If, for example the PDB is ten-residue long, we try to slice it like that : mat[252-1:262].

It should fails when the user wants a range of residues (252-262) from PBstat on a ten-residue PDB indexed from 1 but not when the ten-residue PBD is indexed from 252.

In the traceback I included, the count matrix starts at residue 20 (as stated in the output) and has 89 residues. We want plots for the whole thing. So, in _slice_matrix(), our range of residue indexes (residues_idx) goes from 1 to 89 (which is good, we want the whole matrix) but the next test is failing because residue_max (i.e 128) is not in this range. It's here we mixed the 2 indexes

HubLot commented 7 years ago

I will work on that and also include a test (which we also forget :smile: )

HubLot commented 7 years ago

We cannot solve this with only residue_min and residue_max arguments in the plotting functions. If residue_min == 20, we don't know if the index of first residue (in the count matrix) is 1 or 15 hence, it will affect the output images (either we squeeze 20 residues or 5 respectively).

After some thoughts, the easiest way I found to solve this is to add a optional variable called offset (set to 1 by default) in each plotting functions (plot_map, plot_neq/write_neq and generate_weblogo). It will represent the index of the first residue in the count matrix. From it, we can deduce the range of residues the user wants (in conjunction with residue_min/residue_max) and create the plot images accordingly.