Open Luxxii opened 4 weeks ago
Looks nice already!
I did not test or have a deeper look yet, but from what I understand of your description, the following was discussed in the last meeting:
@julianu Adressing your points:
*.hdf5
) for each raw_file/folder.Currently these single files are not exported to the output folder (to keep the output folder clean). However, i added a comment, where we could also output these single files in the workflow (if desired):
The new hdf5 files are not really larger then the csv files, however i did not compared them directly.
We still have some "compressed data". We have the datatype "S" (-->String) for the pia and featureXML output, since we encode these files into a base64 string. However, this string can be read directly from hdf5. In CSV, we did the same, so i simply included them here with the same logic. They are not used for qc_visualization and are also omitted in the final csv report table.
(n, 3)
, where dimensions would be (Intensity, M/Z, RT)
. This would require additional knowledge about the data. E.G. in extracted Thermo and Bruker headers, we could do either one large matrix (n,y)
(where y is the number of extracted headers), but this would then mean, that we need to know the structure of it. Currently these are single arrays. In these extracted headers, we mostly use the same array (Retention-Time) and plot against a header (e.g. Temperature, or LockMass, etc...)Regarding Point 3, i am fine with both, but tried to keep the data as simple as possible in the hdf5 :)
Tested this branch on 102 measured spectra from various machines. I found and fixed some minor errors in the PIA output and SpikeIn output (edge case of empty array and NoneType). The QC ran through. The final QC.hdf5 file is ~8 GB which seem good enough for over 100 files.
From my side it can be merged, after we discussed the final structure inside the HDF5 for some metrics.
" @KarinSchork maybe you could use the hdf5 directly?"
Do you mean only load the parts of the data that is needed for each plot? I could have a look at it.
Reference to mzQC: https://github.com/HUPO-PSI/psi-ms-CV/blob/master/psi-ms.obo
This repository changed all statistics (qc-results) to output hdf5 (instead of csv).
Major changes:
merge_csv
step was replaced bymerge_hdf5
, which does the same, collect all single statistics (like mzml-statistics, spikein-statistics) and merges them into one file.publishDir
here, since it could be interesting to look into the hdf5 files ("RAW data"). We could discuss if we want to have output hdf5s for each individual spectra file or only a single hdf5 containing all results.QC_visualization
currently only loads the hdf5 and converts it to a pandas dataframe. This is very inefficient and loads everything into RAM. The visualization is mostly untouched. @KarinSchork maybe you could use the hdf5 directly?filename
is ommited, instead we use the real filename of the HDF5, which is<raw_spectra_file>.hdf5
, or in the complete qc -hdf5/<raw_spectra_file/<metric>
HDF5-File Structure
The HDF5 can be seen as a filesystem, with folders and files (--> arrays). I kept the hierarchy flat. This means under
/total_num_ms1
you will get the Total number of MS1 (and so on for all the column-metrics). NO metrics are added under a subfolder for a single file.The most complex data structure we have are:
(1,)
(for single values),(n,)
for arrays of variable length andS
for files in base64 (currently PIA and FeatureXML files).Each entry in the HDF5 has metadata information. You can find
unit
anddescription
there:unit
: If applicable a real unit of the data (like Hz, seconds/minutes or m/z). Otherwise it can contain free text or the stringnone
.description
: This contains a small description of this datapoint. We can write in Python a small documentation for each metric/statistic, this keeps everything in one place. if a column was described in file, i used this description. In others like THERMO or BRUKER Header extraction, the column name itself was used.Example Screenshots (via hdf5view):
Single file metric (here mzml data extraction):
(corresponding metadata)
Complete Single File (here the hdf5 is named "filename.hdf5", containing all statistics):
Whole QC-Results (this is forwarded to the qc-visualization) (raw_spectra_files are in subfolders):
Issues linked to this:
As of now, i am running a test on multiple raw-files at once and will report if this was successfully.