Open KarlLevik opened 4 years ago
Does particle picking happen on a movie, or a motion corrected movie? (presumably the latter). Movie should not contain parameters that come from a processing of a motioncorrection. The relationships look to be the wrong way round and there seems to be an intermediate table missing. I suggest removing the columns from movie, and adding
ParticlePicker particlepickerid pk ai autoprocprogramid fk particlePickingTemplate varchar(255) COMMENT 'Cryolo model', particleDiameter float COMMENT 'Unit: nm'; numberofparticles int
ParticlePicker_has_MotionCorrection ParticlePicker_has_MotionCorrection pk ai particlepickerid fk motioncorrectionid fk
ParticleClassification remove motionCorrectionId int unsigned, particlepickerid fk
i think the json file should go into autoprocprogramattachment?
Hi @stufisher,
Particle picking is made on a motion corrected micrograph from a movie. You are right, movie shouldn't contain parameters from processing. However, even if we could keep track on which particle that came from which motion-corrected micrograph, I'm not sure that it's useful - I don't think we will upload every single particle to ISPyB.
What is more useful I think is to start at 2D classification an work backwards. The pipeline I'm working on at this moment triggers 2D classification at 5000, 20000, 50000 and 100000 particles (these thresholds may be changed in the future). For each threshold we will obtain a 2D classification and it would then be interesting to link this classification to the set of micrographs used, typically around 1000 micrographs for 50000 particles. I think it would be overkill to link thousands motion-corrected micrographs for each 2D classification, maybe the easies would be to link start and end micrograph. We should also store the set of particles used for the 2D classification as we store results from auto processing today. So the AutoProcProgram table would indeed be useful.
Thoughts?
I'm not suggesting storing which particles were picked on an individual bases. ParticlePicker is just an instance of a particle picker program that was run
Ok i understand better your workflow, ive updated the above to take into account how the pipelines are launched without muddying the movie table. This allows you to launch multiple ParticlePickers each with their linkages to which movies were used and how many particles were picked
This sounds good! However, I'm confused - what does 'updated the above' mean? I guess we would need a new SQL script as a base for further discussions.
Sorry i meant i had updated the sql in my previous comment
Ok, for me the sql in your comment looks good! @KarlLevik , any comments?
How easy would it be to visualise this proposed model in a data base diagram? I would like to discuss it with Isai and it's easier to have a graph to show.
Hi @olofsvensson - yes, that probably makes sense.
I've updated my SQL in the initial post above. I'll generate a diagram later today.
Still missing some units in the column comments on ParticleClassification
.
Any better ideas for the name of InitialModel
?
I would love to keep these and the other EM tables in a separate schema (*) as is done with the SSX tables, since this is an extension / module of ISPyB that only a few facilities will use. That way we won't clutter the core schema. Also, it makes it clear that these tables are EM tables without the need to have 'EM' as a prefix in the table names. However, not sure how you feel about that at the ESRF, Olof?
* i.e. different than the "pydb
" or "ispyb
" as we call it at Diamond
This is already core ISPyB, we really do not want to query across multiple databases to retrieve this info.
I agree Stuart besides we prefer to keep the current ISPyB as it is (EM belongs to the core) until the SSX with a modular architecture is proven as a working solution
We need to test if it will fit our needs and other technical details like how we manage the consistency and how aggregate results.
Thanks @KarlLevik for updating the SQL! I'll go through it and I let you know if I have any questions / remarks. I agree with @antolinos and @stufisher that we should keep the EM to the core.
If I understand correctly there would not be an impact to queries across schemas running in the same cluster. However, maybe file this under a good idea for the future. As @antolinos says the SSX extension should be a good test case for such an approach.
Yes, I'm aware EM is part of the ISPyB collaboration, but that doesn't mean the EM tables have to be in the same schema as the tables that are used across all disciplines. In that sense the EM tables are not "core" ISPyB. (Neither are MX-specific tables, of course, but moving those seems like a more involved job than the EM tables which after all are currently not in use at Diamond and only used by 1 (?) instrument at the ESRF at the moment.
Yes @KarlLevik you are right, the EM tables are used only by one instrument (CM01) at the ESRF. However, CM01 has been using the EM tables since 2017 and they are used for every experiments, now up to three experiments per week. Also we have the 2D classification pipeline deployed and we would now like to add the results of 2D classification. More, we will soon also need to add more EM tables related to pre-screening of samples. I agree with you that the best would be to have separate schema for different techniques and this is the direction where we are going with the results of serial crystallography. However, I don't think it would be a good idea to do this change now for EM.
Hi @KarlLevik It does not matter if the number of instruments using EM is 1 or 10000, the work to be done to extract the current EM tables from the core is the same. As I said, it is something that we are not going to do until we are sure that it is worth. This is why we aim to implement that architecture with SSX and then it will be evaluated. With that feedback we could then decide if techniques should be moved from the core or not considering, of course, among other factors the work involved.
Diagrams:
Just the relationships (includes also DataCollection, DataCollectionGroup, Detector and BeamLineSetup):
Full diagram:
Thanks @KarlLevik for these diagrams! I have finally come to the point where I will start to implement these tables in our "valid" (test) database. I have two requests / questions:
@KarlLevik , I need help! I get this error when I try to create the ParticlePicker_has_MotionCorrection table:
1005 - Can't create table
pydb
.ParticlePicker_has_MotionCorrection
(errno: 150 "Foreign key constraint is incorrectly formed") (Details…)
When I click on "Details" I see this (phpMyAdmin 4.6.6):
Storage Engines InnoDB Documentation Percona-XtraDB, Supports transactions, row-level locking, foreign keys and encryption for tables
Can you please help me to find out what's wrong with the following SQL (copied from your initial post):
CREATE TABLE ParticlePicker_has_MotionCorrection (
particlePickerId int unsigned,
motionCorrectionId int unsigned,
PRIMARY KEY (particlePickerId, motionCorrectionId),
CONSTRAINT `ParticlePicker_has_MotionCorrection_fk_particlePickerId`
FOREIGN KEY (`particlePickerId`)
REFERENCES `ParticlePicker` (`particlePickerId`)
ON DELETE CASCADE ON UPDATE CASCADE,
CONSTRAINT `ParticlePicker_has_MotionCorrection_fk_motionCorrectionId`
FOREIGN KEY (`motionCorrectionId`)
REFERENCES `MotionCorrection` (`motionCorrectionId`)
ON DELETE CASCADE ON UPDATE CASCADE
);
I managed to create all other tables including the "ParticleClassification_has_InitialModel" table which has a very similar SQL...
Hi @olofsvensson !
Re: the problem with the ParticlePicker_has_MotionCorrection table, I suspect it's because of a mismatch between the primary key datatype "display size" of the MotionCorrection table and the default "display size" for ints in your database system. Or potentially that your MotionCorrection PK column is defined without the UNSIGNED keyword. So have a look at your MotionCorrection table and find the motionCorrectionId column and see what it says. You can then try to create ParticlePicker_has_MotionCorrection again, but specify exactly the same datatype as you just saw for motionCorrectionId column, e.g. "motionCorrectionId int(11) unsigned".
Re: your question about renaming InitialModel to 3DStartingModel, that is possible, but then we should of course also rename its primary key column and also rename the ParticleClassification_has_InitialModel table accordingly. Is everyone happy with 3DStartingModel?
Re: adding a "symmetry" column to the "ParticleClassification" table: Yes, certainly. So to be clear, do you mean a column that would allow a value equal to one of: triclinic, monoclinic, orthorhombic, tetragonal, trigonal, hexagonal, cubic?
Hi @KarlLevik , thanks for feedback!
I have managed to insert data into the ParticlePicker table using (Java) ISPyB web services (https://github.com/ispyb/ISPyB/issues/518).
I have not yet tested what you suggested for the ParticlePicker_has_MotionCorrection table because I have so far just added a simplified table without the constraints for testing. I'm now wondering if the table ParticlePicker_has_MotionCorrection is practical since each ParticlePicker entry could be based on hundreds if not thousands of movies / motion correction entries. I don't think it's useful to enter all this information in the database for these reasons:
It will take time to upload thousands of motion correction ids
I can't see how this information could be displayed in a useful way
Instead of using the table ParticlePicker_has_MotionCorrection I suggest to add two new columns to the ParticlePicker table:
firstMotionCorrectionId
lastMotionCorrectionId
The first link will ensure that the ParticlePicker entry is linked to a proposal / session. Together with the last link they will at least give an idea which movies that were used for the ParticlePicker.
Coming back to the InitialModel table, maybe it's sufficient to give it a prefix: EMInitialModel, or CryoemInitalModel. 3DStartingModel is probably not a good idea since the name starts with a number... My preference would be CryoemInitalModel.
Concerning the symmetry I don't know which symmetries are allowed for CryoEM processing in general and 2D classification in particular, however, we had a case with five-fold symmetry for which the 2D classification would have been both more accurate and faster if we could have provided the symmetry as input. For the database I don't think we should use a enumeration but just a string column.
Hi again @KarlLevik , I have discussed the topic concerning the ParticlePicker_has_MotionCorrection table with out CryoEM scientists here and we have come to these conclusions:
Would this work for you at the DLS?
Hi Olof, I've just updated the SQL (see the first post above) based on your suggestions. I hope I've captured everything.
We'll have to discuss on our side whether we think this will work for us.
Updated 4th March 2021:
Hi @KarlLevik , thanks for these modifications! This will work for me, at least for 2D classification - I have no experience with 3D classification so I guess we might have to do some tweaks when we start to do these.
I have just one semantic issue with the proposed data model... The 'ParticlePicker' table has now a reference to 'particleClassificationProgramId'. This means that this table is more a classification table than a particle picker table. Also, the fact that many individual 'ParticleClassification' entries from one classification are linked to this table makes it, I think, more a classification table than a particle picker table.
There are two possible solution to this semantic issue:
We can introduce a new table between 'ParticlePicker' and 'ParticleClassification'. This would make 'ParticlePicker' only have to deal with particle picking. However, I don't suggest this solution because I see no point in separating the particle picking from the classification.
We rename 'ParticleClassification' to 'ParticleClassificationEntry' (or something like this) and we rename 'ParticlePicker' to 'ParticleClassification'.
I'm open to keep the suggested structure if you feel that this is not sufficiently important semantic problem.
I think we should not postpone fixing any semantic issues, but rather sort them out now while we can!
I agree, let's not add any extra tables to fix semantic issues, but rather rename the tables we have.
How about renaming the ParticlePicker
table to ParticleClassificationGroup
?
Hi @KarlLevik , after having discussed this issue briefly with CryoEM scientists I now think that the best would be to have an intermediate table for the following reason:
Today we have an one-to-one relation between particle picker and 2D classification, however, this might change in the future. Therefore, even if it at this moment is not necessary, I suggest that we keep the ParticlePicker and we introduce a one-to-many relation with a new table ParticleClassificationGroup. Otherwise we will be in trouble if in the future we have more than one program that can do 2D classification from a given set of particles.
This extra table will make the data model slightly more complex but it's a low price to pay in comparison if we have to come back to this data model in the future.
So, my (final) suggestion would be:
ParticlePicker - 1 ParticleClassificationGroup - 1 ParticleClassification
The 'particleClassificationProgramId' would then be moved from ParticlePicker to ParticleClassificationGroup. We can then rename 'particlePickerProgramId' and 'particleClassificationProgramId' to simply 'programId' since they will be unique for each table.
I had a chat with Olof earlier today and we agreed that we do need a new intermediary table between ParticlePicker
(pp) and ParticleClassification
(pc). Some columns from pp and pc will be moved into the new table, which we will call ParticleClassificationGroup
:
pp.particleClassificationProgramId pc.type pc.batchNumber pc.numberOfParticlesPerBatch pc.numberOfClassesPerBatch pc.symmertry
I've updated the SQL script in the issue description above with the new table structure.
We've discovered that ParticleClassification.rotationAccuracy
should actually have data type float
, not int unsigned
.
@olofsvensson , do you agree?
ALTER TABLE `ParticleClassification`
MODIFY rotationAccuracy float COMMENT '???';
Yes @KarlLevik , I do agree - rotationAccuracy should be of type float.
Hi @KarlLevik , we just discovered that there's a column missing in the "ParticleClassification" table: classDistribution. This parameter is important because it's the best parameter for sorting the results of 2D classification in the GUI. Would you agree to add this column to the data model?
Maybe with some explanation!
What is the purpose of this column? This would be a varchar, if I understand correctly? What should be the max length? What kind of data / values would go into this column?
Yes, sorry, I forgot to mention that 'classDistribution' is a float value produced by Relion 2D classification, like rotationAccuracy, translationAccuracy etc.
There is some discussion of what the class distribution is in this email thread: https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ccpem;ace0783d.1803. It provides a figure of merit for the class. People like to view classes ordered by the class distribution from highest to lowest
I propose the following SQL to modify the table:
ALTER TABLE ParticleClassification
ADD classDistribution float NULL DEFAULT NULL COMMENT 'Provides a figure of merit for the class, higher number is better';
I'll update the CREATE TABLE
at the top of the thread.
@olofsvensson - At Diamond we have identified another column that we need in the ParticlePicker
table: summaryImageFullPath
.
So if you don't mind, we'd like to propose the following change:
ALTER TABLE ParticlePicker
ADD summaryImageFullPath varchar(255) NULL DEFAULT NULL COMMENT 'Generated summary micrograph image with highlighted particles';
Hi @KarlLevik , ok for me to add summaryImageFullPath
in the ParticlePicker
table.
Excellent. I've updated the CREATE TABLE
statements at the top of the thread, and here is the up-to-date diagram - not much has changed since the version from April:
Hi @KarlLevik , thanks for the new diagram!
I just spotted a mistake in the 'PariclePicker' SQL:
CONSTRAINT `ParticlePicker_fk_particlePickerProgramId`
FOREIGN KEY (`particlePickerProgramId`)
REFERENCES `AutoProcProgram` (`autoProcProgramId`)
ON DELETE NO ACTION ON UPDATE CASCADE,
CONSTRAINT `ParticlePicker_fk_programId`
FOREIGN KEY (`programId`)
REFERENCES `AutoProcProgram` (`autoProcProgramId`)
ON DELETE NO ACTION ON UPDATE CASCADE,
I guess the first constraint should be deleted since the 'particlePickerProgramId' column was renamed to 'programId'.
@olofsvensson Yes, you're correct - have removed that now.
ESRF and Diamond have been discussing new tables and table changes for cryo-EM to accommodate software further downstream in the pipeline than we have previously discussed.
The discussion is still ongoing, but here is the SQL I have at this point - please feel free to comment. I have left some of my own inline comments to show some of the current problems/unknowns.
SQL updated based on feedback in this ticket from Stu and Olof (updated most recently on 5. July 2021):