mpc-bioinformatics / McQuaC

Transform the Quality Control workflow from Knime into a workflow in Nextflow
2 stars 0 forks source link

Changed CSV Output to HDF5 Output #85

Open Luxxii opened 4 weeks ago

Luxxii commented 4 weeks ago

This repository changed all statistics (qc-results) to output hdf5 (instead of csv).

Major changes:


HDF5-File Structure

The HDF5 can be seen as a filesystem, with folders and files (--> arrays). I kept the hierarchy flat. This means under /total_num_ms1 you will get the Total number of MS1 (and so on for all the column-metrics). NO metrics are added under a subfolder for a single file.

The most complex data structure we have are: (1,) (for single values), (n,) for arrays of variable length and S for files in base64 (currently PIA and FeatureXML files).

Each entry in the HDF5 has metadata information. You can find unit and description there:

Example Screenshots (via hdf5view):

Single file metric (here mzml data extraction): image

(corresponding metadata) image

Complete Single File (here the hdf5 is named "filename.hdf5", containing all statistics): image

Whole QC-Results (this is forwarded to the qc-visualization) (raw_spectra_files are in subfolders): image


Issues linked to this:

As of now, i am running a test on multiple raw-files at once and will report if this was successfully.

julianu commented 4 weeks ago

Looks nice already!

I did not test or have a deeper look yet, but from what I understand of your description, the following was discussed in the last meeting:

Luxxii commented 4 weeks ago

@julianu Adressing your points:

  1. This is currently the case. We create a final hdf5 file (ending with *.hdf5) for each raw_file/folder.

Currently these single files are not exported to the output folder (to keep the output folder clean). However, i added a comment, where we could also output these single files in the workflow (if desired):

https://github.com/mpc-bioinformatics/Next-QC-Flow/blob/feature_hdf5/src/io/combine_metric_hdf5.nf#L39-L53


  1. No data is compressed any more. Opening an HDF5 yields the raw data. Internally these are gzip-compressed. You can see the compression explicitly in some lines. For single values a compression was omitted. E.G. here we compress the array, which happens in the background:

https://github.com/mpc-bioinformatics/Next-QC-Flow/blob/feature_hdf5/bin/extract_data_from_mzml.py#L212-L216

The new hdf5 files are not really larger then the csv files, however i did not compared them directly.

We still have some "compressed data". We have the datatype "S" (-->String) for the pia and featureXML output, since we encode these files into a base64 string. However, this string can be read directly from hdf5. In CSV, we did the same, so i simply included them here with the same logic. They are not used for qc_visualization and are also omitted in the final csv report table.


  1. I am also not sure about these metrics. We do have some metrics, which only make sense to be plotted together with others. I tried to keep the logic out of the hdf5 as much as possible, so e.g. i avoided an array with (n, 3), where dimensions would be (Intensity, M/Z, RT). This would require additional knowledge about the data. E.G. in extracted Thermo and Bruker headers, we could do either one large matrix (n,y)(where y is the number of extracted headers), but this would then mean, that we need to know the structure of it. Currently these are single arrays. In these extracted headers, we mostly use the same array (Retention-Time) and plot against a header (e.g. Temperature, or LockMass, etc...)

Regarding Point 3, i am fine with both, but tried to keep the data as simple as possible in the hdf5 :)

Luxxii commented 4 weeks ago

Tested this branch on 102 measured spectra from various machines. I found and fixed some minor errors in the PIA output and SpikeIn output (edge case of empty array and NoneType). The QC ran through. The final QC.hdf5 file is ~8 GB which seem good enough for over 100 files.

From my side it can be merged, after we discussed the final structure inside the HDF5 for some metrics.

KarinSchork commented 2 weeks ago

" @KarinSchork maybe you could use the hdf5 directly?"

Do you mean only load the parts of the data that is needed for each plot? I could have a look at it.

Luxxii commented 5 days ago

Reference to mzQC: https://github.com/HUPO-PSI/psi-ms-CV/blob/master/psi-ms.obo