mpc-bioinformatics / McQuaC

Transform the Quality Control workflow from Knime into a workflow in Nextflow
Other
2 stars 0 forks source link

Change CSV Outputs to a more suitable format (HDF5) #83

Open Luxxii opened 3 months ago

Luxxii commented 3 months ago

We discussed the issue of too complex csv-files, which are note human readable. Therefore we could entirely switch to a different format. We settled on HDF5!

di-hardt commented 4 weeks ago

From our discussion about mzQC compliance

Base_Peak_Intensity_Max --> No DA LOCAL:01 (single value (intensity) highest peak in the MS1()/2?) map)
Base_Peak_Intensity_Max_Up_To_105 --> No DA LOCAL:02 (single value highest peak (intensity) in the MS1()/2?) map up to a retention time)
MS1_TIC_Change_Q .... ---> MS:40000057 --> Note on OBO which are fixed size --> Create issue for variable length as for MS:40000061
MS1-TIC_Q... ---> MS:40000058 --> note to OBO they are fixed size --> create issue for variable length like MS:40000061
MS1_Density_Q... --> MS:40000061
MS1_Freq_Max ---> MS:40000065
MS2_Density_Q... --> MS:40000062
MS1_Freq_Max ---> MS:40000066
MS2_Prec_Z_1-5 and also more ----> MS:40000063
MS2_Prec_unknown ---> first in MS:40000063 ---> This value does not exist for the time being ---> Create issue and possibly add to MS:40000063 (charge 0?)
RT_MS1_Q_000-100 --> MS:40000055 --> are also fixed again --> create issue for variable length as with MS:40000061
RT_MS2_Q_000-100 --> MS:40000056 --> are also fixed again --> create issue for variable length like MS:40000061
RT_TIC_Q_000-100 --> MS:40000054 --> are also fixed again --> create issue for variable length like MS:40000061
RT_duration --> MS:40000070 --> The minimum must be added here (first spectrum the RT) (perhaps also specify this, since we have everything: MS:40000067)

SPIKEINS
---> Create table type --> which can then display this in general --> then as issue 

THERMO|BRUKER difficult (perhaps leave as is)

Total_Ion_current_Max --> LOCAL:03 (single values)
Total_Ion_current_Max_up_to105 --> LOCAL:04 (single values)
accumulated_Ms1_Tic --> MS:40000029
accumulated_Ms2_Tic --> MS:40000030
feature_data --> omit
filteres_psms_ppm_error --> LOCAL:05 --> ppm error for each individual identified psm.   --> calculate individual values for MS:40000178 / MS:40000179 (the table is of course still included)
ms1_map_intens/mz/rt --> LOCAL:06 --> raw data from which you make a plot --> we turn it into a table (which is precisely defined in the metadata)
ms1_rt/tic array --> recalculate with ms2 with it --> then store under MS:40000029 --> create query/issue whether it should also be done individually
ms2_rt/tic/mz array --> see above
num_feature_charge_ --> LOCAL:06 ---> create issue and request --> are called quantification data points there
num_feature_ident_charge_ --> LOCAL:07 ---> Create and request issue --> are called quantification data points there
number_of_filtered_peptides --> MS:1003250 --> We take the peptidoforms here (doesn't quite fit either) 
number_of_filtered_psms --> we take MS:1003251 (note that we may have several hits per spectrum for the psms, which is not quite correct) --> create issue that also works / exists with psms
number_of_proteins --> FDR-filtered --> MS:1003327
number_ungrouped_proteins --> FDR-filtered --> Local:08 --> Create issue, because it is missing 
pia_output_z