ispyb / ispyb-database-modeling

4 stars 3 forks source link

Schema fixing: Consolidate DataCollection, EnergyScan and XFEFluorescenceSpectrum #45

Open KarlLevik opened 5 years ago

KarlLevik commented 5 years ago

Big Picture

As has been highlighted in the past, there are a lot of columns in EnergyScan and XFEFluorescenceSpectrum that also exist in DataCollection. Thinking about it, EnergyScan and XFEFluorescenceSpectrum really are types of DataCollection, so should therefore perhaps be merged into DataCollection.

Additionally, there are some columns in the DataCollection table that refer to "images". I think we should rename these so the table make more sense to disciplines that don't use images.

I hereby propose to do this work. Let's keep the name DataCollection and merge the columns from the two other tables into this. We will also need a DataCollection.experimentType enum to say whether the data is from an XRF spectrum, energy scan or other type of data collection as currently enumerated in the DataCollectionGroup.experimentType enum.

Columns moving to DataCollectionFileAttachment

We can use the DataCollectionFileAttachment table to store the following columns in EnergyScan:

... and also these columns in XFEFluorescenceSpectrum:

... and these columns in DataCollection:

(This will probably require some extra values in the fileType enum in DataCollectionFileAttachment.)

Columns moving to DataCollection

We can use the following mapping:

DC EnergyScan Fluorescence
dataCollectionId (PK) energyScanId (PK) xfeFluorescenceSpectrumId (PK)
DCG.sessionId sessionId sessionId
blSampleId blSampleId blSampleId
blSubSampleId blSubSampleId blSubSampleId
detectorId fluorescenceDetector
beamSizeAtSampleX beamSizeHorizontal beamSizeHorizontal
beamSizeAtSampleY beamSizeVertical beamSizeVertical
transmission transmissionFactor beamTransmission
comments comments comments
crystalClass crystalClass crystalClass
edgeEnergy
element
startEnergy
endEnergy
peakEnergy
inflectionEnergy
energy
synchrotronCurrent
startTime startTime startTime
endTime endTime endTime
exposureTime exposureTime exposureTime
fileTemplate filename filename
  inflectionFDoublePrime  
  inflectionFPrime  
  peakFDoublePrime  
  peakFPrime  
averageTemperature temperature  
wavelength   wavelength
totalAbsorbedDose or totalExposedDose? Or neither because EM / different units? xrayDose  
flux flux flux
flux_end flux_end flux_end
imageDirectory workingDirectory workingDirectory
axisStart, axisEnd axisPosition

Renaming image columns in DataCollection

SQL (DML + DDL)

See the work-in-progress for the SQL statements needed for this issue here - please run them in the order given:

Appendix: Wide tables with many NULLs

I know that having lots of columns that we know will be NULL for many of the rows feels instinctively wrong as we could have normalised and broken the table up into a parent table with multiple child tables. (E.g. in the case of DataCollection, maybe this could be done based on experimentType.)

So I want to assure you that at least from a database storage optimisation point of view, wide tables with lots of NULLs is fine. In MariaDB and MySQL, assuming we use the default storage engine, InnoDB, and either DYNAMIC (now default in MariaDB) or COMPACT row format, the following is true:

The variable-length part of the record header contains a bit vector for indicating NULL columns. If the number of columns in the index that can be NULL is N, the bit vector occupies CEILING(N/8) bytes. (For example, if there are anywhere from 9 to 16 columns that can be NULL, the bit vector uses two bytes.) Columns that are NULL do not occupy space other than the bit in this vector.

Source: https://dev.mysql.com/doc/refman/8.0/en/innodb-row-format.html#innodb-row-format-compact

So as long as the columns are NULL-able to start with, we're not wasting any space at all by populating them with NULL values. Whereas with a normalised approach we would need an INT for the primary key of each child table + another INT for the foreign key pointing to the parent table + indexes and other overhead that comes with extra tables.

I'm happy to provide a test case as evidence in support of this, if anyone is interested. E.g. we can create two DataCollection tables in a dev database: One where a lot of columns are populated with NULL values and another where the same columns do not exist. Then we can compare the difference in size of the tables on disk, and hopefully observe that the difference is very small.

stufisher commented 5 years ago

You'll need to be able to define what each entry in DC is, so extend experimenttype enum?

KarlLevik commented 4 years ago

I've added a section above about renaming the 'image' columns in DataCollection + the necessary SQL.

stufisher commented 4 years ago

There is some decoupling to be done here too. EnergyScan for example contains parameters from a processing output (inflection, peak, etc). Really this process should create an autoprocprogram, and populate a table elsewhere with these values. Lots of our display of edgescans (and xfe) comes from file scraping at the moment