Changed CSV Output to HDF5 Output

Luxxii commented 4 weeks ago

This repository changed all statistics (qc-results) to output hdf5 (instead of csv).

Major changes:

CSV is not used to collect all the qc-results, instead hdf5 is used.
The merge_csv step was replaced by merge_hdf5, which does the same, collect all single statistics (like mzml-statistics, spikein-statistics) and merges them into one file.
- I also added publishDir here, since it could be interesting to look into the hdf5 files ("RAW data"). We could discuss if we want to have output hdf5s for each individual spectra file or only a single hdf5 containing all results.
The QC_visualization currently only loads the hdf5 and converts it to a pandas dataframe. This is very inefficient and loads everything into RAM. The visualization is mostly untouched. @KarinSchork maybe you could use the hdf5 directly?
filename is ommited, instead we use the real filename of the HDF5, which is <raw_spectra_file>.hdf5, or in the complete qc -hdf5 /<raw_spectra_file/<metric>

HDF5-File Structure

The HDF5 can be seen as a filesystem, with folders and files (--> arrays). I kept the hierarchy flat. This means under /total_num_ms1 you will get the Total number of MS1 (and so on for all the column-metrics). NO metrics are added under a subfolder for a single file.

The most complex data structure we have are: (1,) (for single values), (n,) for arrays of variable length and S for files in base64 (currently PIA and FeatureXML files).

Each entry in the HDF5 has metadata information. You can find unit and description there:

unit: If applicable a real unit of the data (like Hz, seconds/minutes or m/z). Otherwise it can contain free text or the string none.
description: This contains a small description of this datapoint. We can write in Python a small documentation for each metric/statistic, this keeps everything in one place. if a column was described in file, i used this description. In others like THERMO or BRUKER Header extraction, the column name itself was used.

Example Screenshots (via hdf5view):

Single file metric (here mzml data extraction):

(corresponding metadata)

Complete Single File (here the hdf5 is named "filename.hdf5", containing all statistics):

Whole QC-Results (this is forwarded to the qc-visualization) (raw_spectra_files are in subfolders):

Issues linked to this:

Fixes #83
Fixes #82 (also done here, as minor refactoring)

As of now, i am running a test on multiple raw-files at once and will report if this was successfully.

julianu commented 4 weeks ago

Looks nice already!

I did not test or have a deeper look yet, but from what I understand of your description, the following was discussed in the last meeting:

We will create only one hf5 file per RAW file/folder, which makes visualizations / statistical stuff easier exchangeable
All date in the hf5 should NOT be any more compressed/encoded. This means: there should not be any base64 encoded stuff, but the PIA or other feature files directly in the hf5 file. Nor should any files be zipped (as hf5 itself uses compression)
I am unsure with this points, but IMHO: I would directly save arrays belonging together into one datatype, if possible. E.g. not saving "ms1_rt_array" and "ms1_tic_array", but put this into one "ms1_tic_table" with the rows "rt" and "tic".

Luxxii commented 4 weeks ago

@julianu Adressing your points:

This is currently the case. We create a final hdf5 file (ending with *.hdf5) for each raw_file/folder.

Currently these single files are not exported to the output folder (to keep the output folder clean). However, i added a comment, where we could also output these single files in the workflow (if desired):

https://github.com/mpc-bioinformatics/Next-QC-Flow/blob/feature_hdf5/src/io/combine_metric_hdf5.nf#L39-L53

No data is compressed any more. Opening an HDF5 yields the raw data. Internally these are gzip-compressed. You can see the compression explicitly in some lines. For single values a compression was omitted. E.G. here we compress the array, which happens in the background:

https://github.com/mpc-bioinformatics/Next-QC-Flow/blob/feature_hdf5/bin/extract_data_from_mzml.py#L212-L216

The new hdf5 files are not really larger then the csv files, however i did not compared them directly.

We still have some "compressed data". We have the datatype "S" (-->String) for the pia and featureXML output, since we encode these files into a base64 string. However, this string can be read directly from hdf5. In CSV, we did the same, so i simply included them here with the same logic. They are not used for qc_visualization and are also omitted in the final csv report table.

I am also not sure about these metrics. We do have some metrics, which only make sense to be plotted together with others. I tried to keep the logic out of the hdf5 as much as possible, so e.g. i avoided an array with (n, 3), where dimensions would be (Intensity, M/Z, RT). This would require additional knowledge about the data. E.G. in extracted Thermo and Bruker headers, we could do either one large matrix (n,y)(where y is the number of extracted headers), but this would then mean, that we need to know the structure of it. Currently these are single arrays. In these extracted headers, we mostly use the same array (Retention-Time) and plot against a header (e.g. Temperature, or LockMass, etc...)

Regarding Point 3, i am fine with both, but tried to keep the data as simple as possible in the hdf5 :)

Luxxii commented 4 weeks ago

Tested this branch on 102 measured spectra from various machines. I found and fixed some minor errors in the PIA output and SpikeIn output (edge case of empty array and NoneType). The QC ran through. The final QC.hdf5 file is ~8 GB which seem good enough for over 100 files.

From my side it can be merged, after we discussed the final structure inside the HDF5 for some metrics.

KarinSchork commented 2 weeks ago

" @KarinSchork maybe you could use the hdf5 directly?"

Do you mean only load the parts of the data that is needed for each plot? I could have a look at it.

Luxxii commented 5 days ago

Reference to mzQC: https://github.com/HUPO-PSI/psi-ms-CV/blob/master/psi-ms.obo

mpc-bioinformatics / McQuaC

Changed CSV Output to HDF5 Output #85