haddocking / haddock3

Official repo of the modular BioExcel version of HADDOCK
https://www.bonvinlab.org/haddock3
Apache License 2.0
106 stars 38 forks source link

Presenting cluster information in tables and plots #655

Closed SarahAlidoost closed 2 months ago

SarahAlidoost commented 1 year ago

The dataframes used for creating tables, scatter and box plots have three columns cluster-id, cluster-ranking and capri_rank. Here are two examples where there are Unclustered and Other groups in the dataframes:

      Cluster-id capri_rank cluster-ranking
0          -           1               -
1          -           1               -
2          -           1               -
3          -           1               -
4          -           1               -
        Cluster-id capri_rank  cluster-ranking
125      Other          11               11
129      Other          11               11
131      Other          11               11
92       Other          11               13
108      Other          11               13
109      Other          11               13
119      Other          11               13
111      Other          11               14
130      Other          11               14

The representation of data in plots and tables for these groups is not consistent. For example, a cluster with Cluster-id = "-" is called "Unclustered" in tables and in scatterplots whereas it is "-" in box plots and shown as capri_rank=1 in the x-axis of box plot. Another example, a cluster with Cluster-id="Other" is called "Other" in scatter plots and box plots legends whereas they are shown with cluster-ranking=11, 13, 14 in tables whereas it is shown as capri_rank=11 in the x-axis of box plot.

See more:

0 0 0 0_8000_docking-protein-protein_run1-test-branch_analysis_4_caprieval_analysis_report html

0 0 0 0_8000_docking-protein-protein_run1-test-branch_analysis_4_caprieval_analysis_report html (1)

0 0 0 0_8000_docking-protein-protein_run1-test-branch_analysis_4_caprieval_analysis_report html (2)

0 0 0 0_8000_docking-antibody-antigen_run1-CDR-NMR-CSP-test_analysis_04_caprieval_analysis_report html

0 0 0 0_8000_docking-antibody-antigen_run1-CDR-NMR-CSP-test_analysis_04_caprieval_analysis_report html (1)

mgiulini commented 1 year ago

hey @SarahAlidoost can you be a bit more specific? is there anything unconsistent on the analysis side? Unclustered is a label assigned the cluster_id of a model when there's no clustering data (cluster_id = -), while Other refers to all the clusters (combined together) with cluster rank higher than a threshold (default = 10). These two labels have a very different meaning.

SarahAlidoost commented 1 year ago

hey @SarahAlidoost can you be a bit more specific? is there anything unconsistent on the analysis side?

only on plotting the results and not running the analysis. There is an inconsistency between the labels used in tables, scatters and box plots.

Unclustered is a label assigned the cluster_id of a model when there's no clustering data (cluster_id = -),

The label Unclustered is used as a header in the table and in the legend of the scatters whereas the label "-" is used in the legend of the box plots. In the table, the value of Cluster Rank is "-" while the x-axis of box plots shows capri_rank = 1.

while Other refers to all the clusters (combined together) with cluster rank higher than a threshold (default = 10). These two labels have a very different meaning.

This is another example of the inconsistency of labels. The label "Other" is used in the legend of scatter plots and box plots whereas there is no column "Other" in the table. In the table, there are columns with headers according to cluster-ranking=11, 13, 14. Also, there is no "Other" in the x-axis of box plots but instead, they are all shown as capri_rank=11 (as an example).

Please let me know if it is still unclear.

amjjbonvin commented 1 year ago

This is the expected behaviour in the table. We do compute all cluster statistics, but for plotting purposes only show the top10 and everything else thus becomes other (but not in the table)

And a model which does not cluster is indicated in the table as “-“ and this should translate to unclustered in the plots.

This is another example of the inconsistency of labels. The label "Other" is used in the legend of scatter plots and box plots whereas there is no column "Other" in the table.

amjjbonvin commented 1 year ago

PS: Thus in the plot, what you call cluster 11 should be “others” - it is plotted correctly, but just a label issue