ispyb / py-ispyb

ISPyB backend server based on FastAPI
GNU Lesser General Public License v3.0
12 stars 14 forks source link

implement SSX fundamentals #192

Open mgaonach opened 2 years ago

mgaonach commented 2 years ago

SSX_ispyb.docx provides a list of the fundamental values that we would need to store for SSX experiments. This list can be used as a starting point to build a first SSX prototype in py-ISPyB.

KarlLevik commented 2 years ago

I suppose that where it makes sense, we would like to map these values onto existing tables and columns, i.e. re-use what we already have in the schema.

But we shouldn't shy away from creating new tables and columns either, if that makes more sense.

The "Experiment/BL parameters" values could possibly go into existing and new columns in DataCollectionGroup and/or BeamLineSetup?

Many of the "Data collection" values map onto existing columns in the DataCollection table with the same/similar names. The exceptions are maybe Repetition rate and Energy_bandwidth?

Sample/crystal slurry: I wonder if maybe a new table is needed for this? I suppose the Protein value should just be a proteinId FK pointing to a row in the Protein table. What about Buffer? Is that also a pointer to a row in Protein? Or to BLSample or something else?

"Crystfel_processing", "PyFAI_parameters" and "Crystfel_merging" all sound like different steps in a processing pipeline. At Diamond, we would perhaps consider putting those values into the tables ProcessingJob, ProcessingJobParameter (a key-value table), AutoProcProgram and AutoProcProgramAttachment.

(My understanding of SSX is not great so far, so I'll have to come back to this later when I know more ...)

mgaonach commented 2 years ago

The idea we had was to use existing tables as much as possible when they match and then extend them with new tables dedicated to SSX when extra colums are required. The goal is to avoid adding even more columns to some already big and not very well documented tables, which would make them harder to understand and maintain.

Example: Use DataCollection for all the colums that are already there and create SSXDataCollection with the new SSX-dedicated columns and a PK referencing the DataCollection tuple for the other columns.

I think we will indeed need a new SSXSample table, with FK to Protein and to a new SSXBuffer table (we still need to define what will need to be stored for this buffer entity).

At first, we will focus on the experiment, sample and data collection part of the API, and in a second step come back to the processing API.

Anyways, this is a first and very early stage version of the requirements. A lot of elements still need to be cleared up with the scientists, and we also need to get a better understanding of SSX with their help!

stufisher commented 2 years ago

I think we should probably try to discuss this in a dev meeting before development starts, did the document come from ESRF internal meetings?

In previous ISPyB meetings we had discussed using DataCollection with linker tables for discipline specific new parameters. Most of what you want is provided by DataCollection so we should try to use this as the entry point to avoid duplicating, this also means you dont need to create entirely new resources for SSX.

There are certainly questions about sample definition and linkages there.

ndevenish commented 2 years ago

I remember quite a well developed prototype data model for ISPyB SSX tables, has that work now been abandoned?

stufisher commented 2 years ago

We have monoBandwidth in DiffractionPlan we could add this to DataCollection, it has the same meaning as energyBandwidth.

Looks like experiment / bl parameters should be merged into DataCollection too. DataCollection holds the experiment parameters for this particular datacollection.

Looking more closely, almost everything already exists...

Experiment_type (chip/injector) -> DataCollectionGroup.experimentType -> Extend enum SSX Chip, SSX Injector (or better yet use new ExperimentType table from @KarlLevik and DataCollectionGroup.experimentTypeId) Sample support -> Should be in sample definition? Unsure Jet_material -> Should be in sample definition? Unsure Mono_stripe -> DataCollection, Not sure what this is technically? Stripes are usually related to focusing mirrors, what mono is this? Beam_Size -> DataCollection.beamSizeAtSampleX, DataCollection.beamSizeAtSampleY Temperature -> DataCollection.averageTemperature

Exp.time -> DataCollection.exposureTime Transmission -> DataCollection.transmission Flux -> DataCollection.flux Repetition rate -> New? Energy_bandwidth -> DataCollection.monoBandWidth (need to copy from DiffractionPlan) Beam_center_x -> DataCollection.xBeam Beam_center_y -> DataCollection.yBeam Det_distance -> DataCollection.detectorDistance (and ultimately DataCollection.resolution)

When its comes to sample definition i think we should probably avoid linkages to SAXS tables (as this could be refactored, long in the future)

Protein Buffer -> Depends on complexity of definition. We have Screen, ScreenComponent (linked to Protein) to define multi component (protein) buffers for crystallisation. Protein can represent small molecules, dna, and rna, as well as "proteins" Avg_xtal_size -> Some of these things fit into Crystal Xtal_concentration -> Concentration of crystals within the stream? What units?

Then potentially DataCollection_has_SSX_Sample as linker or DataCollection.ssxSample?

What this ultimately means is that you dont need to create an SSX datacollection end point, you can use the existing multi-discipline /events resource i created.

stufisher commented 2 years ago

I'll come back to processing, crystalfel processing is basically integration, so it should map onto AutoProc/Integration with some modifications. Plots can be added to AutoProcProgramAttachment

Not sure where to store the PyFAI params, as @KarlLevik says, it would be possible to use ProcessingJobParameter for this

mgaonach commented 2 years ago

Here is the interpretation I had of what we could do, excluding the processings section. We will have an internal meeting with the scientists on Monday to clear up the requirements.

SSXExperiment

stufisher commented 2 years ago

Looks good, i think you can probably merge SSXExperiment and SSXDataCollection.

energyBandwidth could go into DataCollection as monoBandwidth, as other beamlines also make use of pink and white beam data collections. E.g. id01, which i am about to deploy daiquiri on, so could be useful for Non-MX too.

mgaonach commented 2 years ago

Looks good, i think you can probably merge SSXExperiment and SSXDataCollection.

Yes, sounds like a good idea to simplify things!

stufisher commented 2 years ago

I think we can probably generalise SSXBuffer as well using Crystal, BLSampleTypeHasComponent, and Protein

stufisher commented 2 years ago
Screenshot 2022-07-15 at 09 27 30

Heres how we can handle creating complex samples with the core tables (without needing to link to the BioSAXS tables)

Some other thoughts:

Historically some developments have created new tables / resources per discipline. This makes it hard to share features. Instead we should try to create tables that support features. MX and other techniques can also benefit from ligands and buffers (or generic ways to describe chemical and other compositions).

We should avoid creating yet another datacollections resource. This should be handled by the events route, we don't want to create a new dc resource for every discipline as it add large chunks of duplicated code. (every data collection list will want to filter by sampleid, time, beamline, etc). We can join the additional tables to the query, this way we have a single data collection resource that can serve multiple disciplines. It could also serve EM with minor modifications (and does at DLS).

Try to keep to naming standards. SSXSample not specimen (specimen is a biosaxs deviation)

We dont need all the sample information with the dc, this information can be queried separately. Split queries and add link in the ui from the dc page to a sample page. We traditionally just return the sample name, and protein acronym with the dc.

Now we only have 3 missing dc and 1 missing sample column, do we need new tables? energyBandwidth and monoStripe are useful outside of ssx and should probably be added directly to DataCollection

SSX/DataCollection: repetitionRate float DEFAULT NULL energyBandwidth float DEFAULT NULL monoStripe varchar(255) DEFAULT NULL

SSX/BLSample: jetMaterial varchar(255) DEFAULT NULL

mgaonach commented 2 years ago

Outdated - jump to https://github.com/ispyb/py-ispyb/issues/192#issuecomment-1219384580

Here is the new iteration of the schema we intend to use for handling SSX data collections and samples.

Existing tables are on the left side and new tables on the right side.

The idea here is to have something generic, without trying to force everything into the existing and not always appropriate tables (i.e. the Protein, BlSampleType_has_Component ect tables).

New features include:

SSX_dc_sample

Here is how an SSX data collection would fit in this schema:

Each DataCollection entity belongs to a DataCollectionGroup. In the use case of SSX, 1 DataCollectionGroup = 1 DataCollection = 1 SSXDataCollection (SSXDataCollection is just an extention of DataCollection with SSX-specific fields). This DataCollectionGroup is linked to the BLSample that was analyzed. This BLSample is created and used only for this DataCollectionGroup. It is made of:

Then we need to know when that new ligand was mixed with the sample so that we can figure out how long it has been in contact with the protein when each image was taken. For that we use the Sequence. We can create a sequence named Sample preparation and add a SequenceEvent at the time of the mixing with SequenceEventType = ReactionTrigger.

This sequence could also be used for laser excitation before image acquisition but this use case is not well defined yet so should be discussed later.

ndevenish commented 2 years ago

SampleEvent seems to be per-image; does that mean (potentially) millions of rows per data session?

I was surprised for similar reasons that Image is an existing table; although I note that we apparently don't use it - I can only see 137 rows on our internal instance.

rjgildea commented 2 years ago

Is Sequence the best choice of table name, when "sequence" tends to have a very specific meaning in structural biology?

rjgildea commented 2 years ago

Are all of these 1-to-many relationships actually 1-to-many, or are some of them intended to be 1-to-1 in practice? E.g. if BLSample -> SampleComposition is 1-to-many, I don't see a link from data collection to a row in SampleComposition.

stufisher commented 2 years ago

Are all of these 1-to-many relationships actually 1-to-many, or are some of them intended to be 1-to-1 in practice? E.g. if BLSample -> SampleComposition is 1-to-many, I don't see a link from data collection to a row in SampleComposition.

Both composition tables are 1-to-many

Crystal -> CrystalComposition -> Component allows: HEPES Buffer 0.5M ph 7.5 NaCl Salt 0.75M i.e. what was in my sample before i did my experiment.

Then during an experiment you maybe add one or more ligand, additive, etc: BLSample -> SampleComposition -> Component allows: mysecretligand Ligand 0.1M Glycerol Additive 10%

You're right though, maybe a link here to DataCollection.dataCollectionId would be sensible for context

mgaonach commented 2 years ago

SampleEvent seems to be per-image; does that mean (potentially) millions of rows per data session?

I was surprised for similar reasons that Image is an existing table; although I note that we apparently don't use it - I can only see 137 rows on our internal instance.

Yes, millions of rows for each session would be problematic. It is still a bit unclear what will be required on the processing side, and whether we need to have a record for each individual image or not. This schema also allows to save a period and repetition, which allows to save all image acquisitions in a single record if they are aquired at a specific frequency - without using the image table. So hopefully we can make it fit in there when the use cases become more clear.

rjgildea commented 2 years ago

Are all of these 1-to-many relationships actually 1-to-many, or are some of them intended to be 1-to-1 in practice? E.g. if BLSample -> SampleComposition is 1-to-many, I don't see a link from data collection to a row in SampleComposition.

Both composition tables are 1-to-many

Crystal -> CrystalComposition -> Component allows: HEPES Buffer 0.5M ph 7.5 NaCl Salt 0.75M i.e. what was in my sample before i did my experiment.

Then during an experiment you maybe add one or more ligand, additive, etc: BLSample -> SampleComposition -> Component allows: mysecretligand Ligand 0.1M Glycerol Additive 10%

You're right though, maybe a link here to DataCollection.dataCollectionId would be sensible for context

Thanks, I see - so would a single BLSample correspond to e.g. a single fixed-target chip, but it can change during the cause of an experiment?

stufisher commented 2 years ago

From my

Are all of these 1-to-many relationships actually 1-to-many, or are some of them intended to be 1-to-1 in practice? E.g. if BLSample -> SampleComposition is 1-to-many, I don't see a link from data collection to a row in SampleComposition.

Both composition tables are 1-to-many Crystal -> CrystalComposition -> Component allows: HEPES Buffer 0.5M ph 7.5 NaCl Salt 0.75M i.e. what was in my sample before i did my experiment. Then during an experiment you maybe add one or more ligand, additive, etc: BLSample -> SampleComposition -> Component allows: mysecretligand Ligand 0.1M Glycerol Additive 10% You're right though, maybe a link here to DataCollection.dataCollectionId would be sensible for context

Thanks, I see - so would a single BLSample correspond to e.g. a single fixed-target chip, but it can change during the cause of an experiment?

From my vague understanding (@mgaonach knows more) this is more specifically for jet based mixing, rather than chips

ndevenish commented 2 years ago

I think the comment was that if there isn't a link to/from SampleComposition then it doesn't make sense - the only way to walk to the SampleComposition would be the one through DataCollectionGroup/Sample (in which case no metainformation to tell which one you used). So, a link from SSXCollection to sample composition would seem to make sense.

Additional queries:

ndevenish commented 2 years ago

This schema also allows to save a period and repetition, which allows to save all image acquisitions in a single record if they are aquired at a specific frequency - without using the image table. So hopefully we can make it fit in there when the use cases become more clear.

I'm not sure this works with the table as-is - it has a timestamp field, which implies a global fixed point in time (to match an image). I could imagine being more useful as e.g. a "Plan" - I'm not sure how often it's a simple single pulse with delay and not a structure of multiple pulses (which could be a reason for multiple SequenceEvents per collection).

we can make it fit in there when the use cases become more clear.

I'm not sure how much it's worth planning fixed-in-stone data structures if we don't know what they are supposed to be used for or what the use cases are. That seems like it will inevitably end up with something complicated to implement because it's trying to be generic, but not generic enough to be useful.

mgaonach commented 2 years ago

I think the comment was that if there isn't a link to/from SampleComposition then it doesn't make sense - the only way to walk to the SampleComposition would be the one through DataCollectionGroup/Sample (in which case no metainformation to tell which one you used). So, a link from SSXCollection to sample composition would seem to make sense.

I am not enterely sure to understand, but I think it comes to something unclear in the schema: the relation between DataCollectionGroup and BLSample is intended to be 1-to-1 in practice. A DataCollection is part of a DataCollectionGroup (in principle only one DataCollection per DataCollectionGroup in the case of SSX). This DataCollectionGroup is associated to a BLSample, which is composed of a list of many Component. So from a DataCollection you can determine the list of Component that were actually in what has been analyzed. Hope this clears up things.

I'm not sure this works with the table as-is - it has a timestamp field, which implies a global fixed point in time (to match an image).

In case of a single event (for instance ligand mixed) then the timestamp is the time at which this event happened. In the case of a repeated event then the timestamp is the first iteration and we can get the others with period and repetition. Maybe we can make this clearer by spliting it into SingleSequenceEvent and RepeatedSequenceEvent tables.

I'm not sure how much it's worth planning fixed-in-stone data structures if we don't know what they are supposed to be used for or what the use cases are.

This is a prototype and in no way fixed in stone. As of now we have good idea of what needs to be saved and we're iterating solutions as it gets refined. This can - and definitely will - be modified before ending up in production.

ndevenish commented 2 years ago

I am not enterely sure to understand, but I think it comes to something unclear in the schema: the relation between DataCollectionGroup and BLSample is intended to be 1-to-1 in practice. A DataCollection is part of a DataCollectionGroup (in principle only one DataCollection per DataCollectionGroup in the case of SSX). This DataCollectionGroup is associated to a BLSample, which is composed of a list of many Component. So from a DataCollection you can determine the list of Component that were actually in what has been analyzed. Hope this clears up things.

The relationship from DCG to BLSample is ~1:1, but often the same ~sample~ ~crystal~ "literal thing you put in front of the beam" is shot several times (in separate groups, which DCG:DataCollection are mostly 1:1 for us). I'm not sure this makes sense for Serial. What I was considering for internal implementation with the current schema was; BLSample is == a single fixed target Chip, and is autogenerated for them (instead of having to manually register every single chip). The limiting step for us on that is that there isn't a good way for users to physically enter the crystal information alone (without going through the whole container/dewer workflow)

is associated to a BLSample, which is composed of a list of many Component

Now, here, a BLSample isn't composed of a list of many Component - it's composed of many SampleComposition, which are themselves composed of many Component, but also of a single (named) ConcentrationType, which is composed of many CrystalComposition, each of which can also be a component.

If BLSample is supposed to be 1:1 "in practice" to data collections (which, I think I agree with, and was how I thought about the problem), then what is the meaning of multiple SampleComposition for each BLSample? If a "Composition" is mixed with both crystalline and non-crystalline parts, then what is ConcentrationType for - which seems to be a mixture of only crystals?

Is the idea that, a "Component" is essentially something mixed into a slurry (a SampleComposition), that can contain crystals (which are also themselves a Component) but also non-crystalline things that are also mixed into the slurry (hence linked to SampleComposition directly).

In the case of a repeated event then the timestamp is the first iteration and we can get the others with period and repetition.

"first iteration" is still per-frame, right? A single image might have several pulses of the laser, or a light pulse and then a defined waiting time and then a substrate injection. Or the idea was that the timestamp is the start of the collection delay is set the same as 1 / fps (in which case how do you handle varying gap times between shots as e.g. the goniometer moves to a different part of the chip).

This is a prototype and in no way fixed in stone. As of now we have good idea of what needs to be saved and we're iterating solutions as it gets refined. This can - and definitely will - be modified before ending up in production.

Okay - so, what led it to be in this shape; what problem is it solving, or what experiment has it been added for? What is it trying to represent?


Maybe this is clear to everyone else, but for me to understand, It might help to have some specific "story" examples of "A serial experiment" and specifically what constraints those have imposed on the complexity here. That might invert the problem from "What is the purpose of this table and it's constraints?" to "This complexity is required to solve problem X"

mgaonach commented 2 years ago

Outdated - jump to https://github.com/ispyb/py-ispyb/issues/192#issuecomment-1269629663

Here is a corrected version of the schema. I fixed the 1-to-many relations that were actually 1-to-1 and removed the missleading link to Image (this part was a very rought idea waiting for further requirements, the idea behind this prototype was focused on the sample part). As before, existing tables are on the left side and new tables on the right side.

SSX2 (3) (1)

Here is how an SSX data collection would fit in this schema:

Each DataCollection entity belongs to a DataCollectionGroup. In the use case of SSX, 1 DataCollectionGroup = 1 DataCollection = 1 SSXDataCollection (SSXDataCollection is just an extention of DataCollection with SSX-specific fields). This DataCollectionGroup is linked to the BLSample that was analyzed. This BLSample is created and used only for this DataCollectionGroup. It is made of:

Then we need to know when that new ligand was mixed with the sample so that we can figure out how long it has been in contact with the protein when each image was taken. For that we use the Sequence. We can create a sequence named Sample preparation and add a SequenceEvent at the time of the mixing with SequenceEventType = ReactionTrigger.

This sequence is also used to describe data acquisition steps with a set of Xray exposure, Xray detection and laser excitation with specific period and number of repetitions.

mgaonach commented 1 year ago

Updated previous comment with latest data model. For UI, please refer to https://github.com/ispyb/py-ispyb-ui/issues/15.

KarlLevik commented 1 year ago

If the relationship between DataCollection and SAXSDataCollection is truly 1-to-1, then the PK of SAXSDataCollection should be dataCollectionId, and there is no need for saxsDataCollectionId.

The same goes for the relationship between DataCollection and SSXHits - the PK of SSXHits should be dataCollectionId.

KarlLevik commented 1 year ago

And yes, please consider using DataCollectionGroup.experimentTypeId -> ExperimentType table, which I think Stu mentioned. There are several advantages to using a lookup table instead of this enum since it has so many enum options now. Although, perhaps that should be its own ticket since it will have impact beyond SSX.

I'm a little bit concerned about the potential size of the GraphData table. How many such rows do we expect per Graph? An alternative could be to put the xy data in files and reference them through the DataCollectionFileAttachment table (fileType = 'xy'). That said, the advantage of storing the data in the database might outweigh the inconvenience of dealing with a potentially big table.

Re: the naming of Sequence, SequenceEvent, SequenceEventType: If we want shorter names, an alternative could be to remove the "Sequence" prefix in the two latter tables. You could also rename Sequence to EventChain if you think that makes the relationship clearer.

Re: the name of SSXHits: The number of hits is one value stored in the table, but there are other values as well. So, perhaps we should have a name that is a bit more generic? A name hinting to the fact that the table contains SSX processing results?

mgaonach commented 1 year ago

If the relationship between DataCollection and SAXSDataCollection is truly 1-to-1, then the PK of SAXSDataCollection should be dataCollectionId, and there is no need for saxsDataCollectionId.

That's a very good point! I will update the model with that.

please consider using DataCollectionGroup.experimentTypeId -> ExperimentType table

I think this table is not in the schema we have. But I agree this enum is not very convenient. We should look into it.

I'm a little bit concerned about the potential size of the GraphData table. How many such rows do we expect per Graph? An alternative could be to put the xy data in files and reference them through the DataCollectionFileAttachment table (fileType = 'xy'). That said, the advantage of storing the data in the database might outweigh the inconvenience of dealing with a potentially big table.

Yes, I was hesitating between both of these options. It is currently used for the unit cell parameters frequency graph (which are divided in 100 bins = 100 rows per graph - I think we could reduce that amount of bins) and we have 6 parameters so 6 graphs = 600 rows per data collection. I went with database storage to see how it works but if you have any experience/input on which would be best it would be very welcome.

Re: the naming of Sequence, SequenceEvent, SequenceEventType: If we want shorter names, an alternative could be to remove the "Sequence" prefix in the two latter tables. You could also rename Sequence to EventChain if you think that makes the relationship clearer.

Sounds good to me. Getting rid of Sequence would also be good to avoid any confusion with the biologic meaning.

Re: the name of SSXHits: The number of hits is one value stored in the table, but there are other values as well. So, perhaps we should have a name that is a bit more generic? A name hinting to the fact that the table contains SSX processing results?

Yes, I was thinking of renaming it to SSXDataCollectionProcessing and adding another SSXExperimentProcessing for the experiment-wide results. If we move to file graph data for the unit cell parameters, the path could also be a column of SSXDataCollectionProcessing.

Thank you very much for your feedback!

stufisher commented 1 year ago

I was also thinking SSXHits.latticeType should be expanded out into its components cell_a, b, c, alpha, beta, gamma. Storing this as a string makes mining hard. Might already be a table for unit cell parameters, ill have to check.

You should also link SSXDataCollectionProcessing to AutoProcProgram, this is the generic way to indicate a process launched / ended (gives us timing, and success stats). Then from that you can also attach graphs and files using AutoProcProgramAttachment

mgaonach commented 1 year ago

Update with latest schema version: SSX2 (4)

Here is how an SSX experiment would fit in this schema:

A DataCollectionGroup is a set of DataCollection on the same physical sample (chip, jet...). Every n images acquired from the sample (for instance n = 10000), we stop the data collection and start a new one. In the use case of SSX, 1 DataCollectionGroup = x DataCollection = x SSXDataCollection (SSXDataCollection is just an extention of DataCollection with SSX-specific fields). This DataCollectionGroup is linked to the BLSample that was analyzed. This BLSample is created and used only for this DataCollectionGroup. It is made of:

A Crystal, which can be re-used accross many samples, when the same crystal source is used in different samples. This crystal is itself composed of:

Then the sample also contains some other Components that have been added specifically for this DataCollectionGroup, defined through the liker table SampleComposition on the same principle as CrystalComposition.

Let's say we added a new Ligand durring the data collection to see what happend. Then we need to know when that new ligand was mixed with the sample so that we can figure out how long it has been in contact with the protein when each image was taken. For that we use the EventChain. We can create a EventChain named Sample preparation and add an Event at the time of the mixing with EventType = ReactionTrigger.

This EventChain is also used to describe data acquisition steps with a set of Xray exposure, Xray detection and laser excitation with specific period and number of repetitions.

After each DataCollection is complete some processing will run to determine the number of crystal hits, indexed images and statistics on unit cell parameters. This information is saved in SSXDataCollectionProcessing with a link through AutoProcProgram and AutoProcProgramAttachment to the file containing the unit cell parameters information.

rjgildea commented 1 year ago

Thanks for the comprehensive description! One thing that is still not entirely clear from your description is whether a single EventChain is intended to apply to every image in the data collection (this seems to me to be the sane thing to do) or should every image have its own EventChain? If the former, then is datetime the appropriate data type for Event.time? Perhaps Event.offset with data type float would be more meaningful?

What does Event.repetition represent in this context? How would you e.g. represent a fixed target dose series of 10 images collected per position on the fixed target chip? Would this be a single Xray detection event with offset corresponding to the first image, with repetition = 10? Or 10 Xray detection events each with their own offset corresponding to the start of the nth image?

mgaonach commented 1 year ago

The way I thought about it is that a EventChain represents what happenened during the whole data collection. So in your example it would be more like the second solution: 10 Xray detection events each with their own offset = start of the nth image, repetition = number of positions in the chip and period = time interval between two positions. And then we could add after all of them a Move target event to represent position change, with the same repetition and period values.

As for absolute time vs offset (relative to DataCollection.start?), I'll see with the beamline people which seems to make more sense to them. Might also be clearer to have something like startOffset/Time or firstOffset/Time instead.