ispyb / ispyb-database-modeling

4 stars 3 forks source link

EM Data Model #14

Open stufisher opened 7 years ago

stufisher commented 7 years ago

Following up from https://github.com/antolinos/em-model/issues/1, here is the latest EM Data Model that i have put together from Alex's input, DLS Scisoft & EM Staff, and EPN EM people:

EM Model

stufisher commented 6 years ago

@antolinos i got some clarification on the per movie nominaldefocus we were discussing. There is a value recorded with each movie but it is a total guess. It is determined properly by the CTF correction. So debatable as to whether we should store it. (it is apparently captured in the xml file)

antolinos commented 6 years ago

Hi @stufisher,

Thanks. We are starting with Scipion and the ISPyB monitors and gathering all metadata that it will pushed into ISPyB later on. My feeling today is that some parameters will need to be stored per movie. As soon as we got a clear and clean data flow we will share it with you.

olofsvensson commented 6 years ago

Hi @stufisher and @antolinos,

I have been exterminating the files produced by our CryoEM, and after discussion with Isai here's a suggested list of meta-data we would like to start to upload to ISPyB after each movie acquisition (i.e. before motion correction):

Common meta-data to all movies:

Individual movie meta-data:

This list will probably be extended in the future, however, for now it should get us going.

After a quick inspection of the suggested data model I found that these parameter can be stored without any modification:

However, I don't see how these parameters can be fitted:

Maybe we need to add a new specific "movie" table?

This is just a start of discussion and not a list of requirements written in stone...

stufisher commented 6 years ago

I'm trying to avoid a movie table if it all possible as it sends us down the same hole as the Image table for mx, we should really avoid saving full paths to jpg, mrc, and xml. The Image table does not scale well at all, hence why we have abandoned it at DLS (we can assume long term EM will scale like MX has so should think carefully about this now!). We should be able to construct per movie jpg, mrc, xml files from other variables as images are in mx. (DC.fileprefix, DC.imagedirectory, GridImageMap.some sequential number)

Dose per movie is an interesting one, we know the total dose of the whole exp, cant we just divide through, or is each movie really unique? Do people actually care as this will be calculated properly in MotionCorr after?

Sequential index of movie is stored in GridImageMap, and i will add a timestamp in there too, as this is required here too

olofsvensson commented 6 years ago

Hi @stufisher, I agree that we should think carefully about this now so that we don't have the same situation as the Image table. The situation is though not quite the same:

The file name from a SR data collection can be found via a template and an image number. This is not true for a Cryo-EM movie file name: FoilHole_19150795_Data_19148847_19148848_20170619_2101-0344.mrc. For each movie many parts of the filename change:

We can find the date, time and the sequential index from the GridImageMap, but where will we be able to find the other foilhole identifiers? You can argue that we don't need them since we have a unique sequential index, however, this is not true for the corresponding mrc, jpg and xml files:

Current data rates from one Cryo-EM instrument is (I guess) about 10-20 movies / minute, while current image rates from one SR data collection is > 1000 images / minute. So, the question is if the data rate from Cryo-EMs are going to be significantly increased in the not so far future?

stufisher commented 6 years ago

We could add some other identifier fields to gridimagemap and store the corresponding numbers, these fields would then have a fixed size and be more scalable than a varchar(255)

i.e.

GridImageMap identifier1 int identifier2 int

I really want to keep these generic too, and not EM specific if possible.

Why the FEI/Gatan? software cant write sane file names is beyond me...

I think we should try to assume nothing, when ISPyB was designed 10 years ago we didnt expect MX to collect ~1000s images a second. 1-3kfps detectors already exist for EM (we make one)

stufisher commented 6 years ago

Following from yesterdays discussion i have now added a movie table and deprecated gridimagemap: em_ispyb_model

antolinos commented 6 years ago

Thanks. @olofsvensson and I are still working on webservices and even if most likely this will change I wanted to keep you updated and get your feedback:

So, this is preliminary structure for Movie table: image

We propose to rename movieFullPath by moviePath and add some extra fields.

antolinos commented 6 years ago

Hi @stufisher,

This is how it looks like now: image

Please have a look as there are some changes due to:

In both cases, we are not sure then some discussion about that would be appreciated.

There are still few parameters that belong to datacollection:

We are thinking about specializing a new table called EMDataCollection with these values. It will avoid to increase the number of columns on data collection and will make ISPyB more scalable.

stufisher commented 6 years ago

Please specify these explicitly, the two last points are quite different from each other!

You have undone a lot of my work here. You have renamed a lot of the columns, I don't understand? Why not work from our existing schema, rather than starting from scratch?

I had conceded and added a table (movie) to store a single varchar(255) per movie, now you have added another 4 columns of the same dimensions to a table that we discussed is going to be heavily populated and may grow exponentially over time. Can we not determine the xml path from the movieFullPath? I had chosen movieFullPath as the name to be consistent with the other tables in ISPyB.

As i had previously described:

Please can you provide a diff from my schema?

Movie

I'm not sure why you have micrograph or micrograph snapshot in movie? Does a micrograph even exist at this point? Movie is as it says a series of frames, is a micrograph not constructed from these via another process => MotionCorrection? (at least the one people will look at) dosePerImage = dosePerFrame in MotionCorrection (=duplication?)

MotionCorrection

You have added log file, please remove it. MotionCorrection links to AutoProcProgram, which has a link to AutoProcProgramAttachment where logs should be stored timestamp should be removed it is catered for by AutoProcProgram as well do we really need another varchar(255) to the dose corrected micrograph?

CTF

You have added log file, please remove it. CTF links to AutoProcProgram, which has a link to AutoProcProgramAttachment where logs should be stored timestamp should be removed it is catered for by AutoProcProgram as well

I'm not sure why you have removed amplitudeContrast from CTF. It is not a function of the movie/datacollection, it is determined by the CTF correction.

Why have spectraImage and spectraImageThumbnail? We dont need both. Why rename from fftTheoretical? (its a [fast] fourier transform of the micrograph + the theoretical one from the CTF function)

MotionCorrectionDrift

is the data that makes up the driftPlotFullPath. You will probably show the driftPlotFullPath in EXI, i want access to the raw data. Kevin tells me this data is available to Scipion somewhere

Lots of other columns have changed name, i dont understand why...

stufisher commented 4 years ago

Can we try and pick up converging on this model?