ispyb / ispyb-database-modeling

4 stars 3 forks source link

Schema fixing: Optimise the structure of the ImageQualityIndicators table #24

Open KarlLevik opened 6 years ago

KarlLevik commented 6 years ago

At Diamond the ImageQualityIndicators table is the largest (in terms of records) in the whole database, approximately 114 million records.

At this size we need to make sure the table is optimally structured, that unused/unnecessary columns and indexes are removed etcetera.

In the first instance I'd like to propose the following change:

ALTER TABLE ImageQualityIndicators 
  MODIFY dataCollectionId int(11) unsigned NOT NULL FIRST,
  MODIFY imageNumber mediumint(8) unsigned NOT NULL AFTER dataCollectionId,
  DROP FOREIGN KEY _ImageQualityIndicators_ibfk3,
  DROP KEY ImageQualityIndicators_ibfk3,
  DROP PRIMARY KEY,
  DROP imageQualityIndicatorsId,
  ADD PRIMARY KEY (dataCollectionId, imageNumber);

As you can see, I'm removing the primary key (including the column), removing the explicit foreign key to DataCollection (but not the column itself), creating a new, compound primary key consisting of dataCollectionId and imageNumber, and (for cosmetic reasons) moving those two columns first in the table.

The advantages of this are:

Please note:

I do realize there are other tables in the database that also need to be reviewed and probably modified, but we need to start somewhere, and this is as good a candidate as any.

Further changes that could be done would be to remove unused columns. My colleague Markus did a review and here's what he had to say:

We currently write meaningful information to:

datacollectionid image_number

for dozor:

dozor_score

for in-house per-image-analysis:

spots_total totalintegratedsignal good_bragg_candidates method1_res

and then we fill these values as well:

in_res_total # always set to same value as spots_total method2_res # always same value as method1_res programid # always 65228265 icerings # always 0 maxunitcell # always 0 pctsaturationtop50peaks # always 0 inresolutionovrlspots # always 0 binpopcutoffmethod2res # always 0

What do you think about the proposed changes? And are all the columns populated with meaningful data at other synchrotrons? Let's discuss ...

antolinos commented 6 years ago

We are not using this table any more but we still have data inside so as far as modifications are compatible with the current structure will be fine with us.

However as the relation between ImageQualityIndicators and datacollection is via Image table , this will not work for us as it is today: ADD PRIMARY KEY (dataCollectionId, imageNumber);

We are not filling dataCollectionId and imageNumber

KarlLevik commented 6 years ago

We had to go ahead with these changes in order to cope with increased insert rates. After the change (during the last shutdown) the problems previously seen have gone away. Going forward, we may still have to implement further changes to cope with the increasing size of the table.

We don't use the Image table any more, so for us the relation is directly from IQI to DC.

delageniere commented 6 years ago

We have to discuss it internally : perhaps we do not need any more the old data in these tables and could remove them and we have to see how it will behave with the proposed changes.