ispyb / ispyb-database-modeling

4 stars 3 forks source link

Schema fixing: Rename Protein, Crystal and DiffractionPlan #42

Open KarlLevik opened 5 years ago

KarlLevik commented 5 years ago

In order to make the database schema more universal and less MX specific, I'd like to propose the following set of changes:

Proposed solution v1.0

Rename these three tables: Protein ➞ Component Crystal ➞ BLSampleType DiffractionPlan ➞ DataCollectionPlan

... and their primary key columns: Protein.proteinId ➞ Component.componentId Crystal.crystalId ➞ BLSampleType.blSampleTypeId DiffractionPlan.diffractionPlanId ➞ DataCollectionPlan.dataCollectionPlanId

... and rename this column: Protein.sequence ➞ Component.content

... as well as all foreign key columns referencing the renamed tables: BLSample.diffractionPlanId ➞ BLSample.dataCollectionPlanId Crystal.proteinId ➞ BLSampleType.componentId Crystal.diffractionPlanId ➞ BLSampleType.dataCollectionPlanId

Changes in bold (not all columns shown):

protein-crystal-diffractionplan

Proposed solution v2.0

NOTE:

Rename these tables: Protein ➞ SampleMaterial Crystal ➞ SampleProperties BLSample ➞ SampleInstance BLSampleGroup ➞ SampleInstanceGroup ComponentType ➞ SampleMaterialType ComponentSubType ➞ SampleMaterialSubType Component_has_SubType ➞ SampleMaterial_has_SubType BLSampleType_has_Component ➞ SampleProperties_has_Material BLSampleGroup_has_BLSample ➞ SampleInstanceGroup_has_Instance DiffractionPlan ➞ DataCollectionPlan

... and their primary key columns: Protein.proteinId ➞ SampleMaterial.sampleMaterialId Crystal.crystalId ➞ SampleProperties.samplePropertiesId DiffractionPlan.diffractionPlanId ➞ DataCollectionPlan.dataCollectionPlanId

... and rename this column: Protein.sequence ➞ SampleMaterial.content

... as well as renaming all foreign key columns referencing the renamed tables, but make SampleProperties an optional table, so we want a foreign key column sampleMaterialId in the SampleInstance table, and we want to remove the foreign key from SampleProperties pointing to SampleMaterial:

BLSample.diffractionPlanId ➞ SampleInstance.dataCollectionPlanId Crystal.proteinId: remove SampleProperties.sampleMaterialId: new column Crystal.diffractionPlanId ➞ SampleProperties.dataCollectionPlanId

Changes in bold (not all columns shown):

SampleMaterial-Properties-Instance

Appendix: Examples of sample material types (non-exhaustive)

These values could be used to populate the SampleMaterialType table. (Perhaps another column could be added to indicate which "instrument type" the material type is valid for?)

MX

EM

X-ray Pair Distribution Function

Powder diffraction

Small Angle Scattering (SAS)

KarlLevik commented 5 years ago

At the recent ISPyB meeting in Lund, the steering committee decided the main focus should be MX, EM and BioSAXS.

In light of this, I believe the changes proposed above still make a lot of sense.

Protein: Protein crystallography is just one type of MX. ComponentType allows the Component to be protein, DNA, RNA, small molecule and anything else appropriate we can think of within MX (viruses, ribosomes?). Knowing which kind of sample you're dealing with is of course crucial information to the processing pipelines. EM looks mostly at proteins and viruses, BioSAXS mostly proteins.

ComponentSubType and Component_has_SubType allow a Component to have e.g. salt, buffer and precipitant, which is useful in BioSAXS.

Crystal: EM and BioSAXS don't use crystals, so a more universal name would make sense.

DiffractionPlan: Again, "diffraction" doesn't make sense for EM and BioSAXS.

KarlLevik commented 5 years ago

Briefly discussed at VC meeting today. It was agreed that ESRF would come up with a proposal that would potentially incorporate/merge the Macromolecule table currently only used for BioSAXS. 'Macromoluecule' may be a better term than 'component' anyway? Opinions appreciated.

It was also agreed that we could use a separate extra face-to-face meeting to address the issues in the current database schema, aiming to maybe address technical debt, incorporating more universal terms etcetera. (Soleil offered to host the meeting, potentially in September. Potential dates will be circulated.)

stufisher commented 5 years ago

It would be worth looking carefully into the BioSAXS table structure, there is much duplication of core tables, and i think a lot could be merged (or make use of existing tables). Does anyone have the biosaxs structure diagram to hand?

stufisher commented 5 years ago

Macromolecule is making assumptions, component is more generic. If we are going to fix things we should not tie ourselves to specific disciplines. Powder diffraction folk dont refer to their samples as macromolecules, and say none biosaxs saxs users wanted to register their samples? I realise the focus is EM/BiosSAXS/MX, but lets not tie ourselves to those things if we can.

KarlLevik commented 5 years ago

Maybe better names would be: Protein ➞ SampleMaterial Crystal ➞ SampleProperties BLSample ➞ SampleInstance

I think these better represent the data with which we'll be populating these tables, and should also be fairly easy to understand.

stufisher commented 5 years ago

agreed, those sound better

antolinos commented 5 years ago

Hi @stufisher BioSAXS schema can be found here: https://github.com/ispyb/ISPyB/blob/master/documentation/database/ISPyB_DataModel_5.11.1.png

I am still not convinced, even if the Crystal table is renamed to SampleProperties, the columns in the table still represent a crystal: space group, unit cell, etc... and it will never be a generic table for properties of a generic sample at least:

Besides, I wonder if SampleProperties should be in between sampleMaterial and sampleInstance. Either it defines properties of the sample or of the material but not sure if it should define both at the same time...

Just my opinion!

Question for you: where do you store the properties of a sample for powder diffraction today?

stufisher commented 5 years ago

We added a single field theorecticalDensity to crystal for powder diffraction. Ironically we mostly bypass crystal for mx (every sample gets a new crystal = simple workflow, crystal is a huge assumption for mx, you dont know what the crystal form is until you collect the data), and it actually much heavily used in powder diffraction where they do something like:

define a phase (protein) define a sample made of multiple phases (crystal, with one or more proteins) define an instance of a sample (blsample of a crystal)

We have to have a think about what "generic" means, how many fields do we envisage to describe a sample/material. To get an understanding of this we have to know about what other types of materials are studied (which @KarlLevik is looking in to), and which disciplines benefit from something like ispyb.

As discussed in https://github.com/ispyb/ispyb-database-modeling/issues/45 tables with many columns are not bad in terms of storage, and in my opinion technically simpler than joining many extra tables (which you will have to do anyway, and outer join).

antolinos commented 5 years ago

We added a single field theorecticalDensity to crystal for powder diffraction. Ironically we mostly bypass crystal for mx (every sample gets a new crystal = simple workflow, crystal is a huge assumption for mx, you dont know what the crystal form is until you collect the data),

This is the reason why I said if SampleProperties (crystal) should be in between

and it actually much heavily used in powder diffraction where they do something like:

define a phase (protein) define a sample made of multiple phases (crystal, with one or more proteins) define an instance of a sample (blsample of a crystal)

We have to have a think about what "generic" means, how many fields do we envisage to describe a sample/material. To get an understanding of this we have to know about what other types of materials are studied (which @KarlLevik is looking in to), and which disciplines benefit from something like ispyb.

@KarlLevik might want to have a look at this if not done yet: http://download.nexusformat.org/doc/html/classes/base_classes/NXsample.html?highlight=nxsample

As discussed in #45 tables with many columns are not bad in terms of storage, and in my opinion technically simpler than joining many extra tables (which you will have to do anyway, and outer join).

As you know, my main concern is not the storage but the (software) maintenance. I find that using the same column for different purposes (even if clarified in the comments) depending on the technique is, in general, a bad idea prone to mistakes. Having many columns in a table, from the performance point of view, might not be a big deal but it might cause confusion to undestand what is stored on it and then can lead to errors. I am not saying that we should have a 100% normalized schema but a fair compromise. Not a single table with 300 columns tables but not 30 more tables for the data collection specialization neither. I think that a fair compromise can be done between both approaches.

KarlLevik commented 5 years ago

It sounds as if it would be good if SampleProperties (Crystal) was somehow an optional table, i.e. we wouldn't need to populate it in order to have a SampleInstance (BLSample) with a SampleMaterial (Protein).

I agree that a column ideally shouldn't have values that mean different things for different rows in the same table.

Thanks for the links, @antolinos - I'll see if I can extract anything useful from NeXus and/or the BioSAXS diagram.

stufisher commented 5 years ago

As you know, my main concern is not the storage but the (software) maintenance. I find that using the same column for different purposes (even if clarified in the comments) depending on the technique is, in general, a bad idea prone to mistakes. This is absolutely not the intention, i agree reusing columns for different purposes is confusing, we did this briefly at the beginning to get things running (and is no longer done). Where they are reused between disciplines they still have the same meaning (hence the addition of columns, not just reuse)

re NXSample, theres not actually that many columns if you consider this split across Protein, Crystal, and BLSample, and you already put this structure into ICAT+ for datahub right?

KarlLevik commented 5 years ago

Had a look at the NXsample page. I think changing the ISPyB model to NXsample would be very challenging. We can of course borrow some of the terminology. E.g. they have a chemical_formula field which I think corresponds to our SampleMaterial.contents / Protein.sequence? Also, the NXsample type field - which is an enumeration - seems interesting.

KarlLevik commented 5 years ago

I've also been staring at the BioSAXS diagram for a couple of days, but as it's very unfamiliar terrain to me (we don't use it at Diamond and I wasn't involved in the design), I can't really make any useful recommendations. I can obviously see that Macromolecule could be mapped to Protein / SampleMaterial and that maybe Specimen could be mapped to BLSample / SampleInstance, but beyond that I'm not really getting anywhere.