Open KarlLevik opened 5 years ago
At the recent ISPyB meeting in Lund, the steering committee decided the main focus should be MX, EM and BioSAXS.
In light of this, I believe the changes proposed above still make a lot of sense.
Protein
: Protein crystallography is just one type of MX. ComponentType
allows the Component
to be protein, DNA, RNA, small molecule and anything else appropriate we can think of within MX (viruses, ribosomes?). Knowing which kind of sample you're dealing with is of course crucial information to the processing pipelines. EM looks mostly at proteins and viruses, BioSAXS mostly proteins.
ComponentSubType
and Component_has_SubType
allow a Component
to have e.g. salt, buffer and precipitant, which is useful in BioSAXS.
Crystal
: EM and BioSAXS don't use crystals, so a more universal name would make sense.
DiffractionPlan
: Again, "diffraction" doesn't make sense for EM and BioSAXS.
Briefly discussed at VC meeting today. It was agreed that ESRF would come up with a proposal that would potentially incorporate/merge the Macromolecule
table currently only used for BioSAXS. 'Macromoluecule' may be a better term than 'component' anyway? Opinions appreciated.
It was also agreed that we could use a separate extra face-to-face meeting to address the issues in the current database schema, aiming to maybe address technical debt, incorporating more universal terms etcetera. (Soleil offered to host the meeting, potentially in September. Potential dates will be circulated.)
It would be worth looking carefully into the BioSAXS table structure, there is much duplication of core tables, and i think a lot could be merged (or make use of existing tables). Does anyone have the biosaxs structure diagram to hand?
Macromolecule is making assumptions, component is more generic. If we are going to fix things we should not tie ourselves to specific disciplines. Powder diffraction folk dont refer to their samples as macromolecules, and say none biosaxs saxs users wanted to register their samples? I realise the focus is EM/BiosSAXS/MX, but lets not tie ourselves to those things if we can.
Maybe better names would be: Protein ➞ SampleMaterial Crystal ➞ SampleProperties BLSample ➞ SampleInstance
I think these better represent the data with which we'll be populating these tables, and should also be fairly easy to understand.
agreed, those sound better
Hi @stufisher BioSAXS schema can be found here: https://github.com/ispyb/ISPyB/blob/master/documentation/database/ISPyB_DataModel_5.11.1.png
I am still not convinced, even if the Crystal table is renamed to SampleProperties, the columns in the table still represent a crystal: space group, unit cell, etc... and it will never be a generic table for properties of a generic sample at least:
SampleProperties -> CrystalSampleProperties
-> PowderSampleProperties
Besides, I wonder if SampleProperties should be in between sampleMaterial and sampleInstance. Either it defines properties of the sample or of the material but not sure if it should define both at the same time...
Just my opinion!
Question for you: where do you store the properties of a sample for powder diffraction today?
We added a single field theorecticalDensity to crystal for powder diffraction. Ironically we mostly bypass crystal for mx (every sample gets a new crystal = simple workflow, crystal is a huge assumption for mx, you dont know what the crystal form is until you collect the data), and it actually much heavily used in powder diffraction where they do something like:
define a phase (protein) define a sample made of multiple phases (crystal, with one or more proteins) define an instance of a sample (blsample of a crystal)
We have to have a think about what "generic" means, how many fields do we envisage to describe a sample/material. To get an understanding of this we have to know about what other types of materials are studied (which @KarlLevik is looking in to), and which disciplines benefit from something like ispyb.
As discussed in https://github.com/ispyb/ispyb-database-modeling/issues/45 tables with many columns are not bad in terms of storage, and in my opinion technically simpler than joining many extra tables (which you will have to do anyway, and outer join).
We added a single field theorecticalDensity to crystal for powder diffraction. Ironically we mostly bypass crystal for mx (every sample gets a new crystal = simple workflow, crystal is a huge assumption for mx, you dont know what the crystal form is until you collect the data),
This is the reason why I said if SampleProperties (crystal) should be in between
and it actually much heavily used in powder diffraction where they do something like:
define a phase (protein) define a sample made of multiple phases (crystal, with one or more proteins) define an instance of a sample (blsample of a crystal)
We have to have a think about what "generic" means, how many fields do we envisage to describe a sample/material. To get an understanding of this we have to know about what other types of materials are studied (which @KarlLevik is looking in to), and which disciplines benefit from something like ispyb.
@KarlLevik might want to have a look at this if not done yet: http://download.nexusformat.org/doc/html/classes/base_classes/NXsample.html?highlight=nxsample
As discussed in #45 tables with many columns are not bad in terms of storage, and in my opinion technically simpler than joining many extra tables (which you will have to do anyway, and outer join).
As you know, my main concern is not the storage but the (software) maintenance. I find that using the same column for different purposes (even if clarified in the comments) depending on the technique is, in general, a bad idea prone to mistakes. Having many columns in a table, from the performance point of view, might not be a big deal but it might cause confusion to undestand what is stored on it and then can lead to errors. I am not saying that we should have a 100% normalized schema but a fair compromise. Not a single table with 300 columns tables but not 30 more tables for the data collection specialization neither. I think that a fair compromise can be done between both approaches.
It sounds as if it would be good if SampleProperties (Crystal) was somehow an optional table, i.e. we wouldn't need to populate it in order to have a SampleInstance (BLSample) with a SampleMaterial (Protein).
I agree that a column ideally shouldn't have values that mean different things for different rows in the same table.
Thanks for the links, @antolinos - I'll see if I can extract anything useful from NeXus and/or the BioSAXS diagram.
As you know, my main concern is not the storage but the (software) maintenance. I find that using the same column for different purposes (even if clarified in the comments) depending on the technique is, in general, a bad idea prone to mistakes.
This is absolutely not the intention, i agree reusing columns for different purposes is confusing, we did this briefly at the beginning to get things running (and is no longer done). Where they are reused between disciplines they still have the same meaning (hence the addition of columns, not just reuse)
re NXSample, theres not actually that many columns if you consider this split across Protein, Crystal, and BLSample, and you already put this structure into ICAT+ for datahub right?
Had a look at the NXsample page. I think changing the ISPyB model to NXsample would be very challenging. We can of course borrow some of the terminology. E.g. they have a chemical_formula
field which I think corresponds to our SampleMaterial.contents
/ Protein.sequence
? Also, the NXsample type
field - which is an enumeration - seems interesting.
I've also been staring at the BioSAXS diagram for a couple of days, but as it's very unfamiliar terrain to me (we don't use it at Diamond and I wasn't involved in the design), I can't really make any useful recommendations. I can obviously see that Macromolecule
could be mapped to Protein
/ SampleMaterial
and that maybe Specimen
could be mapped to BLSample
/ SampleInstance
, but beyond that I'm not really getting anywhere.
In order to make the database schema more universal and less MX specific, I'd like to propose the following set of changes:
Proposed solution v1.0
Rename these three tables: Protein ➞ Component Crystal ➞ BLSampleType DiffractionPlan ➞ DataCollectionPlan
... and their primary key columns: Protein.proteinId ➞ Component.componentId Crystal.crystalId ➞ BLSampleType.blSampleTypeId DiffractionPlan.diffractionPlanId ➞ DataCollectionPlan.dataCollectionPlanId
... and rename this column: Protein.sequence ➞ Component.content
... as well as all foreign key columns referencing the renamed tables: BLSample.diffractionPlanId ➞ BLSample.dataCollectionPlanId Crystal.proteinId ➞ BLSampleType.componentId Crystal.diffractionPlanId ➞ BLSampleType.dataCollectionPlanId
Changes in bold (not all columns shown):
Proposed solution v2.0
NOTE:
SampleMaterial
withMacromolecule
, andSampleInstance
withSpecimen
.Rename these tables: Protein ➞ SampleMaterial Crystal ➞ SampleProperties BLSample ➞ SampleInstance BLSampleGroup ➞ SampleInstanceGroup ComponentType ➞ SampleMaterialType ComponentSubType ➞ SampleMaterialSubType Component_has_SubType ➞ SampleMaterial_has_SubType BLSampleType_has_Component ➞ SampleProperties_has_Material BLSampleGroup_has_BLSample ➞ SampleInstanceGroup_has_Instance DiffractionPlan ➞ DataCollectionPlan
... and their primary key columns: Protein.proteinId ➞ SampleMaterial.sampleMaterialId Crystal.crystalId ➞ SampleProperties.samplePropertiesId DiffractionPlan.diffractionPlanId ➞ DataCollectionPlan.dataCollectionPlanId
... and rename this column: Protein.sequence ➞ SampleMaterial.content
... as well as renaming all foreign key columns referencing the renamed tables, but make
SampleProperties
an optional table, so we want a foreign key columnsampleMaterialId
in theSampleInstance
table, and we want to remove the foreign key fromSampleProperties
pointing toSampleMaterial
:BLSample.diffractionPlanId ➞ SampleInstance.dataCollectionPlanId Crystal.proteinId: remove SampleProperties.sampleMaterialId: new column Crystal.diffractionPlanId ➞ SampleProperties.dataCollectionPlanId
Changes in bold (not all columns shown):
Appendix: Examples of sample material types (non-exhaustive)
These values could be used to populate the
SampleMaterialType
table. (Perhaps another column could be added to indicate which "instrument type" the material type is valid for?)MX
EM
X-ray Pair Distribution Function
Powder diffraction
Small Angle Scattering (SAS)
Samples in solution
Emulsions
Liquid crystals
Biomaterials
Materials science
Surface / grazing incidence
Dynamic systems thereof (experiments)