Schema fixing: Consolidate DataCollection, EnergyScan and XFEFluorescenceSpectrum

KarlLevik commented 5 years ago

Big Picture

As has been highlighted in the past, there are a lot of columns in EnergyScan and XFEFluorescenceSpectrum that also exist in DataCollection. Thinking about it, EnergyScan and XFEFluorescenceSpectrum really are types of DataCollection, so should therefore perhaps be merged into DataCollection.

Additionally, there are some columns in the DataCollection table that refer to "images". I think we should rename these so the table make more sense to disciplines that don't use images.

I hereby propose to do this work. Let's keep the name DataCollection and merge the columns from the two other tables into this. We will also need a DataCollection.experimentType enum to say whether the data is from an XRF spectrum, energy scan or other type of data collection as currently enumerated in the DataCollectionGroup.experimentType enum.

Columns moving to DataCollectionFileAttachment

We can use the DataCollectionFileAttachment table to store the following columns in EnergyScan:

scanFileFullPath
jpegChoochFileFullPath
filename
choochFileFullPath

... and also these columns in XFEFluorescenceSpectrum:

jpegScanFileFullPath
annotatedPymcaXfeSpectrum
fittedDataFileFullPath
scanFileFullPath

... and these columns in DataCollection:

xtalSnapshotFullPath1
xtalSnapshotFullPath2
xtalSnapshotFullPath3
xtalSnapshotFullPath4
processedDataFile
datFullPath
bestWilsonPlotPath

(This will probably require some extra values in the fileType enum in DataCollectionFileAttachment.)

Columns moving to DataCollection

We can use the following mapping:

DC	EnergyScan	Fluorescence
dataCollectionId (PK)	energyScanId (PK)	xfeFluorescenceSpectrumId (PK)
DCG.sessionId	sessionId	sessionId
blSampleId	blSampleId	blSampleId
blSubSampleId	blSubSampleId	blSubSampleId
detectorId	fluorescenceDetector
beamSizeAtSampleX	beamSizeHorizontal	beamSizeHorizontal
beamSizeAtSampleY	beamSizeVertical	beamSizeVertical
transmission	transmissionFactor	beamTransmission
comments	comments	comments
crystalClass	crystalClass	crystalClass
	edgeEnergy
	element
	startEnergy
	endEnergy
	peakEnergy
	inflectionEnergy
		energy
	synchrotronCurrent
startTime	startTime	startTime
endTime	endTime	endTime
exposureTime	exposureTime	exposureTime
fileTemplate	filename	filename
	inflectionFDoublePrime
	inflectionFPrime
	peakFDoublePrime
	peakFPrime
averageTemperature	temperature
wavelength		wavelength
totalAbsorbedDose or totalExposedDose? Or neither because EM / different units?	xrayDose
flux	flux	flux
flux_end	flux_end	flux_end
imageDirectory	workingDirectory	workingDirectory
axisStart, axisEnd		axisPosition

Renaming image columns in DataCollection

numberOfImages ➞ numberOfDataPoints
startImageNumber ➞ startDataPointNumber
imageDirectory ➞ dataDirectory
imagePrefix ➞ filePrefix
imageSuffix ➞ fileSuffix
imageContainerSubPath ➞ dataContainerSubPath

SQL (DML + DDL)

See the work-in-progress for the SQL statements needed for this issue here - please run them in the order given:

New processing tables for energy scans and MCAs.
Changes to the DataCollections tables.
The data migration script.
The statements to drop the old tables (clean-up)

Appendix: Wide tables with many NULLs

I know that having lots of columns that we know will be NULL for many of the rows feels instinctively wrong as we could have normalised and broken the table up into a parent table with multiple child tables. (E.g. in the case of DataCollection, maybe this could be done based on experimentType.)

So I want to assure you that at least from a database storage optimisation point of view, wide tables with lots of NULLs is fine. In MariaDB and MySQL, assuming we use the default storage engine, InnoDB, and either DYNAMIC (now default in MariaDB) or COMPACT row format, the following is true:

The variable-length part of the record header contains a bit vector for indicating NULL columns. If the number of columns in the index that can be NULL is N, the bit vector occupies CEILING(N/8) bytes. (For example, if there are anywhere from 9 to 16 columns that can be NULL, the bit vector uses two bytes.) Columns that are NULL do not occupy space other than the bit in this vector.

Source: https://dev.mysql.com/doc/refman/8.0/en/innodb-row-format.html#innodb-row-format-compact

So as long as the columns are NULL-able to start with, we're not wasting any space at all by populating them with NULL values. Whereas with a normalised approach we would need an INT for the primary key of each child table + another INT for the foreign key pointing to the parent table + indexes and other overhead that comes with extra tables.

I'm happy to provide a test case as evidence in support of this, if anyone is interested. E.g. we can create two DataCollection tables in a dev database: One where a lot of columns are populated with NULL values and another where the same columns do not exist. Then we can compare the difference in size of the tables on disk, and hopefully observe that the difference is very small.

stufisher commented 5 years ago

You'll need to be able to define what each entry in DC is, so extend experimenttype enum?

KarlLevik commented 4 years ago

I've added a section above about renaming the 'image' columns in DataCollection + the necessary SQL.

stufisher commented 4 years ago

There is some decoupling to be done here too. EnergyScan for example contains parameters from a processing output (inflection, peak, etc). Really this process should create an autoprocprogram, and populate a table elsewhere with these values. Lots of our display of edgescans (and xfe) comes from file scraping at the moment

ispyb / ispyb-database-modeling