WIP: New tables and changes for cryo-EM

KarlLevik commented 4 years ago

ESRF and Diamond have been discussing new tables and table changes for cryo-EM to accommodate software further downstream in the pipeline than we have previously discussed.

The discussion is still ongoing, but here is the SQL I have at this point - please feel free to comment. I have left some of my own inline comments to show some of the current problems/unknowns.

SQL updated based on feedback in this ticket from Stu and Olof (updated most recently on 5. July 2021):

CREATE TABLE ParticlePicker (
  particlePickerId int unsigned auto_increment PRIMARY KEY,
  programId int unsigned,
  firstMotionCorrectionId int unsigned,
  particlePickingTemplate varchar(255) COMMENT 'Cryolo model',
  particleDiameter float COMMENT 'Unit: nm',
  numberOfParticles int unsigned,
  summaryImageFullPath varchar(255) DEFAULT NULL COMMENT 'Generated summary micrograph image with highlighted particles',
  CONSTRAINT `ParticlePicker_fk_programId`
    FOREIGN KEY (`programId`)
      REFERENCES `AutoProcProgram` (`autoProcProgramId`)
        ON DELETE NO ACTION ON UPDATE CASCADE,
    CONSTRAINT `ParticlePicker_fk_motionCorrectionId`
      FOREIGN KEY (`firstMotionCorrectionId`)
        REFERENCES `MotionCorrection` (`motionCorrectionId`)
          ON DELETE NO ACTION ON UPDATE CASCADE
)
COMMENT 'An instance of a particle picker program that was run';

CREATE TABLE `ParticleClassificationGroup` (
  particleClassificationGroupId int unsigned auto_increment PRIMARY KEY,
  particlePickerId int unsigned,
  programId int unsigned,
  type enum('2D', '3D') COMMENT 'Indicates the type of particle classification',
  batchNumber int unsigned COMMENT 'Corresponding to batch number',
  numberOfParticlesPerBatch int unsigned COMMENT 'total number of particles per batch (a large integer)',
  numberOfClassesPerBatch int unsigned,
  symmetry varchar(20),
  CONSTRAINT `ParticleClassificationGroup_fk_particlePickerId`
    FOREIGN KEY (`particlePickerId`)
      REFERENCES `ParticlePicker` (`particlePickerId`)
        ON DELETE CASCADE ON UPDATE CASCADE,
  CONSTRAINT `ParticleClassificationGroup_fk_programId`
    FOREIGN KEY (`programId`)
      REFERENCES `AutoProcProgram` (`autoProcProgramId`)
        ON DELETE NO ACTION ON UPDATE CASCADE
);

-- 2D and 3D particle classification fields are almost the same, so using
-- the same tables seems to make sense:
CREATE TABLE ParticleClassification (
  particleClassificationId int unsigned auto_increment PRIMARY KEY,
  particleClassificationGroupId int unsigned,
  classNumber int unsigned COMMENT 'Identified of the class. A unique ID given by Relion',
  classImageFullPath varchar(255) COMMENT 'The PNG of the class',
  particlesPerClass int unsigned COMMENT 'Number of particles within the selected class, can then be used together with the total number above to calculate the percentage', -- is this just numberOfParticlesPerBatch * numberOfClassesPerBatch ? If so, this column is unnecessary.
  rotationAccuracy int unsigned COMMENT '???', -- or "accuracyRotations" as in spreadsheet? float?
  translationAccuracy float COMMENT 'Unit: Angstroms', 
  estimatedResolution float COMMENT '???, Unit: Angstroms',
  overallFourierCompleteness float,
  classDistribution float NULL DEFAULT NULL COMMENT 'Provides a figure of merit for the class, higher number is better',
  -- classRepresentativePreview ?,
  CONSTRAINT `ParticleClassification_fk_particlePickerId`
    FOREIGN KEY (`particlePickerId`)
      REFERENCES `ParticlePicker` (`particlePickerId`)
        ON DELETE CASCADE ON UPDATE CASCADE,
  CONSTRAINT `ParticleClassification_fk_particleClassificationGroupId`
      FOREIGN KEY (`particleClassificationGroupId`)
        REFERENCES `ParticleClassificationGroup` (`particleClassificationGroupId`)
          ON DELETE CASCADE ON UPDATE CASCADE
)
COMMENT 'Results of 2D or 2D classification';

-- Are we creating a "final" model later? If so, should they 
-- go into the same table?
-- Also, the ispyb schema has a couple of other "model" tables, so this should
-- probably have some kind of prefix to make clear it's for EM.
CREATE TABLE CryoemInitialModel (
  cryoemInitialModelId int unsigned auto_increment PRIMARY KEY,
  resolution float COMMENT 'Unit: Angstroms',
  numberOfParticles int unsigned
)
COMMENT 'Initial cryo-EM model generation results';

-- n:m relationship between ParticleClassification and InitialModel:
CREATE TABLE ParticleClassification_has_CryoemInitialModel (
  particleClassificationId int unsigned,
  cryoemInitialModelId int unsigned,
  PRIMARY KEY (particleClassificationId, cryoemInitialModelId),
  CONSTRAINT `ParticleClassification_has_CryoemInitialModel_fk1`
    FOREIGN KEY (`particleClassificationId`)
      REFERENCES `ParticleClassification` (`particleClassificationId`)
        ON DELETE CASCADE ON UPDATE CASCADE,
  CONSTRAINT `ParticleClassification_has_InitialModel_fk2`
    FOREIGN KEY (`cryoemInitialModelId`)
      REFERENCES `CryoemInitialModel` (`cryoemInitialModelId`)
        ON DELETE CASCADE ON UPDATE CASCADE
);

stufisher commented 4 years ago

Does particle picking happen on a movie, or a motion corrected movie? (presumably the latter). Movie should not contain parameters that come from a processing of a motioncorrection. The relationships look to be the wrong way round and there seems to be an intermediate table missing. I suggest removing the columns from movie, and adding

ParticlePicker particlepickerid pk ai autoprocprogramid fk particlePickingTemplate varchar(255) COMMENT 'Cryolo model', particleDiameter float COMMENT 'Unit: nm'; numberofparticles int

ParticlePicker_has_MotionCorrection ParticlePicker_has_MotionCorrection pk ai particlepickerid fk motioncorrectionid fk

ParticleClassification remove motionCorrectionId int unsigned, particlepickerid fk

i think the json file should go into autoprocprogramattachment?

olofsvensson commented 4 years ago

Hi @stufisher,

Particle picking is made on a motion corrected micrograph from a movie. You are right, movie shouldn't contain parameters from processing. However, even if we could keep track on which particle that came from which motion-corrected micrograph, I'm not sure that it's useful - I don't think we will upload every single particle to ISPyB.

What is more useful I think is to start at 2D classification an work backwards. The pipeline I'm working on at this moment triggers 2D classification at 5000, 20000, 50000 and 100000 particles (these thresholds may be changed in the future). For each threshold we will obtain a 2D classification and it would then be interesting to link this classification to the set of micrographs used, typically around 1000 micrographs for 50000 particles. I think it would be overkill to link thousands motion-corrected micrographs for each 2D classification, maybe the easies would be to link start and end micrograph. We should also store the set of particles used for the 2D classification as we store results from auto processing today. So the AutoProcProgram table would indeed be useful.

Thoughts?

stufisher commented 4 years ago

I'm not suggesting storing which particles were picked on an individual bases. ParticlePicker is just an instance of a particle picker program that was run

Ok i understand better your workflow, ive updated the above to take into account how the pipelines are launched without muddying the movie table. This allows you to launch multiple ParticlePickers each with their linkages to which movies were used and how many particles were picked

olofsvensson commented 4 years ago

This sounds good! However, I'm confused - what does 'updated the above' mean? I guess we would need a new SQL script as a base for further discussions.

stufisher commented 4 years ago

Sorry i meant i had updated the sql in my previous comment

olofsvensson commented 4 years ago

Ok, for me the sql in your comment looks good! @KarlLevik , any comments?

How easy would it be to visualise this proposed model in a data base diagram? I would like to discuss it with Isai and it's easier to have a graph to show.

KarlLevik commented 3 years ago

Hi @olofsvensson - yes, that probably makes sense.

I've updated my SQL in the initial post above. I'll generate a diagram later today.

Still missing some units in the column comments on ParticleClassification.

Any better ideas for the name of InitialModel?

I would love to keep these and the other EM tables in a separate schema (*) as is done with the SSX tables, since this is an extension / module of ISPyB that only a few facilities will use. That way we won't clutter the core schema. Also, it makes it clear that these tables are EM tables without the need to have 'EM' as a prefix in the table names. However, not sure how you feel about that at the ESRF, Olof?

* i.e. different than the "pydb" or "ispyb" as we call it at Diamond

stufisher commented 3 years ago

This is already core ISPyB, we really do not want to query across multiple databases to retrieve this info.

antolinos commented 3 years ago

I agree Stuart besides we prefer to keep the current ISPyB as it is (EM belongs to the core) until the SSX with a modular architecture is proven as a working solution We need to test if it will fit our needs and other technical details like how we manage the consistency and how aggregate results.

olofsvensson commented 3 years ago

Thanks @KarlLevik for updating the SQL! I'll go through it and I let you know if I have any questions / remarks. I agree with @antolinos and @stufisher that we should keep the EM to the core.

drnasmith commented 3 years ago

If I understand correctly there would not be an impact to queries across schemas running in the same cluster. However, maybe file this under a good idea for the future. As @antolinos says the SSX extension should be a good test case for such an approach.

KarlLevik commented 3 years ago

Yes, I'm aware EM is part of the ISPyB collaboration, but that doesn't mean the EM tables have to be in the same schema as the tables that are used across all disciplines. In that sense the EM tables are not "core" ISPyB. (Neither are MX-specific tables, of course, but moving those seems like a more involved job than the EM tables which after all are currently not in use at Diamond and only used by 1 (?) instrument at the ESRF at the moment.

olofsvensson commented 3 years ago

Yes @KarlLevik you are right, the EM tables are used only by one instrument (CM01) at the ESRF. However, CM01 has been using the EM tables since 2017 and they are used for every experiments, now up to three experiments per week. Also we have the 2D classification pipeline deployed and we would now like to add the results of 2D classification. More, we will soon also need to add more EM tables related to pre-screening of samples. I agree with you that the best would be to have separate schema for different techniques and this is the direction where we are going with the results of serial crystallography. However, I don't think it would be a good idea to do this change now for EM.

antolinos commented 3 years ago

Hi @KarlLevik It does not matter if the number of instruments using EM is 1 or 10000, the work to be done to extract the current EM tables from the core is the same. As I said, it is something that we are not going to do until we are sure that it is worth. This is why we aim to implement that architecture with SSX and then it will be evaluated. With that feedback we could then decide if techniques should be moved from the core or not considering, of course, among other factors the work involved.

KarlLevik commented 3 years ago

Diagrams:

Just the relationships (includes also DataCollection, DataCollectionGroup, Detector and BeamLineSetup): EM_tables_2020-12-09

Full diagram: EM_tables_full_2020-12-09

olofsvensson commented 3 years ago

Thanks @KarlLevik for these diagrams! I have finally come to the point where I will start to implement these tables in our "valid" (test) database. I have two requests / questions:

Would it be possible to rename the "InitialModel" to "3DStartingModel"? I find "InitialModel" too vague...
Sometimes it's useful to provide symmetry information to the 2D classification processing. Could we add a "symmetry" column to the "ParticleClassification" table?

olofsvensson commented 3 years ago

@KarlLevik , I need help! I get this error when I try to create the ParticlePicker_has_MotionCorrection table:

1005 - Can't create table pydb.ParticlePicker_has_MotionCorrection (errno: 150 "Foreign key constraint is incorrectly formed") (Details…)

When I click on "Details" I see this (phpMyAdmin 4.6.6):

Storage Engines InnoDB Documentation Percona-XtraDB, Supports transactions, row-level locking, foreign keys and encryption for tables

Can you please help me to find out what's wrong with the following SQL (copied from your initial post):

CREATE TABLE ParticlePicker_has_MotionCorrection (
  particlePickerId int unsigned,
  motionCorrectionId int unsigned,
  PRIMARY KEY (particlePickerId, motionCorrectionId),
  CONSTRAINT `ParticlePicker_has_MotionCorrection_fk_particlePickerId`
    FOREIGN KEY (`particlePickerId`)
      REFERENCES `ParticlePicker` (`particlePickerId`)
        ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT `ParticlePicker_has_MotionCorrection_fk_motionCorrectionId`
      FOREIGN KEY (`motionCorrectionId`)
        REFERENCES `MotionCorrection` (`motionCorrectionId`)
          ON DELETE CASCADE ON UPDATE CASCADE
);

I managed to create all other tables including the "ParticleClassification_has_InitialModel" table which has a very similar SQL...

KarlLevik commented 3 years ago

Hi @olofsvensson !

Re: the problem with the ParticlePicker_has_MotionCorrection table, I suspect it's because of a mismatch between the primary key datatype "display size" of the MotionCorrection table and the default "display size" for ints in your database system. Or potentially that your MotionCorrection PK column is defined without the UNSIGNED keyword. So have a look at your MotionCorrection table and find the motionCorrectionId column and see what it says. You can then try to create ParticlePicker_has_MotionCorrection again, but specify exactly the same datatype as you just saw for motionCorrectionId column, e.g. "motionCorrectionId int(11) unsigned".

Re: your question about renaming InitialModel to 3DStartingModel, that is possible, but then we should of course also rename its primary key column and also rename the ParticleClassification_has_InitialModel table accordingly. Is everyone happy with 3DStartingModel?

Re: adding a "symmetry" column to the "ParticleClassification" table: Yes, certainly. So to be clear, do you mean a column that would allow a value equal to one of: triclinic, monoclinic, orthorhombic, tetragonal, trigonal, hexagonal, cubic?

olofsvensson commented 3 years ago

Hi @KarlLevik , thanks for feedback!

I have managed to insert data into the ParticlePicker table using (Java) ISPyB web services (https://github.com/ispyb/ISPyB/issues/518).

I have not yet tested what you suggested for the ParticlePicker_has_MotionCorrection table because I have so far just added a simplified table without the constraints for testing. I'm now wondering if the table ParticlePicker_has_MotionCorrection is practical since each ParticlePicker entry could be based on hundreds if not thousands of movies / motion correction entries. I don't think it's useful to enter all this information in the database for these reasons:

It will take time to upload thousands of motion correction ids
I can't see how this information could be displayed in a useful way

Instead of using the table ParticlePicker_has_MotionCorrection I suggest to add two new columns to the ParticlePicker table:

firstMotionCorrectionId
lastMotionCorrectionId

The first link will ensure that the ParticlePicker entry is linked to a proposal / session. Together with the last link they will at least give an idea which movies that were used for the ParticlePicker.

Coming back to the InitialModel table, maybe it's sufficient to give it a prefix: EMInitialModel, or CryoemInitalModel. 3DStartingModel is probably not a good idea since the name starts with a number... My preference would be CryoemInitalModel.

Concerning the symmetry I don't know which symmetries are allowed for CryoEM processing in general and 2D classification in particular, however, we had a case with five-fold symmetry for which the 2D classification would have been both more accurate and faster if we could have provided the symmetry as input. For the database I don't think we should use a enumeration but just a string column.

olofsvensson commented 3 years ago

Hi again @KarlLevik , I have discussed the topic concerning the ParticlePicker_has_MotionCorrection table with out CryoEM scientists here and we have come to these conclusions:

It is desirable to keep track on which motion-corrected micro-graphs which were used for a particular particle picking result.
This information is available as a '.star' file from Relion.
Therefore, it should be enough to store the particle picking .star file in the database and link it via an AutoProcProgramAttachment to the ParticlePicker table.
Thus ParticlePicker_has_MotionCorrection table is not needed. However, we must have at least one reference to a motion correction entry in the database, which could be solved by a new column "firstMotionCorrectionId".
Also, it is desirable to have information on how many particles were picked on each micro-graph. Again this fine-grained information does not to be modeled in the database, a file containing this information is enough (csv, text, etc) attached to the ParticlePicker via an AutoProcProgramAttachment.

Would this work for you at the DLS?

KarlLevik commented 3 years ago

Hi Olof, I've just updated the SQL (see the first post above) based on your suggestions. I hope I've captured everything.

We'll have to discuss on our side whether we think this will work for us.

EM_tables_full_2021-02-18

KarlLevik commented 3 years ago

Updated 4th March 2021:

EM_tables_full_2021-03-04

olofsvensson commented 3 years ago

Hi @KarlLevik , thanks for these modifications! This will work for me, at least for 2D classification - I have no experience with 3D classification so I guess we might have to do some tweaks when we start to do these.

I have just one semantic issue with the proposed data model... The 'ParticlePicker' table has now a reference to 'particleClassificationProgramId'. This means that this table is more a classification table than a particle picker table. Also, the fact that many individual 'ParticleClassification' entries from one classification are linked to this table makes it, I think, more a classification table than a particle picker table.

There are two possible solution to this semantic issue:

We can introduce a new table between 'ParticlePicker' and 'ParticleClassification'. This would make 'ParticlePicker' only have to deal with particle picking. However, I don't suggest this solution because I see no point in separating the particle picking from the classification.
We rename 'ParticleClassification' to 'ParticleClassificationEntry' (or something like this) and we rename 'ParticlePicker' to 'ParticleClassification'.

I'm open to keep the suggested structure if you feel that this is not sufficiently important semantic problem.

KarlLevik commented 3 years ago

I think we should not postpone fixing any semantic issues, but rather sort them out now while we can!

I agree, let's not add any extra tables to fix semantic issues, but rather rename the tables we have.

How about renaming the ParticlePicker table to ParticleClassificationGroup?

olofsvensson commented 3 years ago

Hi @KarlLevik , after having discussed this issue briefly with CryoEM scientists I now think that the best would be to have an intermediate table for the following reason:

Today we have an one-to-one relation between particle picker and 2D classification, however, this might change in the future. Therefore, even if it at this moment is not necessary, I suggest that we keep the ParticlePicker and we introduce a one-to-many relation with a new table ParticleClassificationGroup. Otherwise we will be in trouble if in the future we have more than one program that can do 2D classification from a given set of particles.

This extra table will make the data model slightly more complex but it's a low price to pay in comparison if we have to come back to this data model in the future.

So, my (final) suggestion would be:

ParticlePicker - 1 ParticleClassificationGroup - 1 ParticleClassification

The 'particleClassificationProgramId' would then be moved from ParticlePicker to ParticleClassificationGroup. We can then rename 'particlePickerProgramId' and 'particleClassificationProgramId' to simply 'programId' since they will be unique for each table.

KarlLevik commented 3 years ago

I had a chat with Olof earlier today and we agreed that we do need a new intermediary table between ParticlePicker (pp) and ParticleClassification (pc). Some columns from pp and pc will be moved into the new table, which we will call ParticleClassificationGroup:

pp.particleClassificationProgramId pc.type pc.batchNumber pc.numberOfParticlesPerBatch pc.numberOfClassesPerBatch pc.symmertry

I've updated the SQL script in the issue description above with the new table structure.

KarlLevik commented 3 years ago

EM_tables_2021-04-12

KarlLevik commented 3 years ago

We've discovered that ParticleClassification.rotationAccuracy should actually have data type float, not int unsigned.

@olofsvensson , do you agree?

ALTER TABLE `ParticleClassification`
  MODIFY rotationAccuracy float COMMENT '???';

olofsvensson commented 3 years ago

Yes @KarlLevik , I do agree - rotationAccuracy should be of type float.

olofsvensson commented 3 years ago

Hi @KarlLevik , we just discovered that there's a column missing in the "ParticleClassification" table: classDistribution. This parameter is important because it's the best parameter for sorting the results of 2D classification in the GUI. Would you agree to add this column to the data model?

KarlLevik commented 3 years ago

Maybe with some explanation!

What is the purpose of this column? This would be a varchar, if I understand correctly? What should be the max length? What kind of data / values would go into this column?

olofsvensson commented 3 years ago

Yes, sorry, I forgot to mention that 'classDistribution' is a float value produced by Relion 2D classification, like rotationAccuracy, translationAccuracy etc.

d-j-hatton commented 3 years ago

There is some discussion of what the class distribution is in this email thread: https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ccpem;ace0783d.1803. It provides a figure of merit for the class. People like to view classes ordered by the class distribution from highest to lowest

KarlLevik commented 3 years ago

I propose the following SQL to modify the table:

ALTER TABLE ParticleClassification
  ADD classDistribution float NULL DEFAULT NULL COMMENT 'Provides a figure of merit for the class, higher number is better';

I'll update the CREATE TABLE at the top of the thread.

KarlLevik commented 3 years ago

@olofsvensson - At Diamond we have identified another column that we need in the ParticlePicker table: summaryImageFullPath.

So if you don't mind, we'd like to propose the following change:

ALTER TABLE ParticlePicker
  ADD summaryImageFullPath varchar(255) NULL DEFAULT NULL COMMENT 'Generated summary micrograph image with highlighted particles';

olofsvensson commented 3 years ago

Hi @KarlLevik , ok for me to add summaryImageFullPath in the ParticlePickertable.

KarlLevik commented 3 years ago

Excellent. I've updated the CREATE TABLE statements at the top of the thread, and here is the up-to-date diagram - not much has changed since the version from April:

EM_tables_full_2021-07-02

olofsvensson commented 3 years ago

Hi @KarlLevik , thanks for the new diagram!

I just spotted a mistake in the 'PariclePicker' SQL:

  CONSTRAINT `ParticlePicker_fk_particlePickerProgramId`
    FOREIGN KEY (`particlePickerProgramId`)
      REFERENCES `AutoProcProgram` (`autoProcProgramId`)
        ON DELETE NO ACTION ON UPDATE CASCADE,
  CONSTRAINT `ParticlePicker_fk_programId`
    FOREIGN KEY (`programId`)
      REFERENCES `AutoProcProgram` (`autoProcProgramId`)
        ON DELETE NO ACTION ON UPDATE CASCADE,

I guess the first constraint should be deleted since the 'particlePickerProgramId' column was renamed to 'programId'.

KarlLevik commented 3 years ago

@olofsvensson Yes, you're correct - have removed that now.

ispyb / ispyb-database-modeling

WIP: New tables and changes for cryo-EM #64

1005 - Can't create table `pydb`.`ParticlePicker_has_MotionCorrection` (errno: 150 "Foreign key constraint is incorrectly formed") (Details…)

ispyb / ispyb-database-modeling

WIP: New tables and changes for cryo-EM #64

1005 - Can't create table pydb.ParticlePicker_has_MotionCorrection (errno: 150 "Foreign key constraint is incorrectly formed") (Details…)

1005 - Can't create table `pydb`.`ParticlePicker_has_MotionCorrection` (errno: 150 "Foreign key constraint is incorrectly formed") (Details…)