lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
201 stars 38 forks source link

Differences between TSV and Parquet results.sage output #118

Closed RalfG closed 3 months ago

RalfG commented 5 months ago

Hi Michael,

While adding support for Sage Parquet result file output to psm_utils, I noticed some discrepancies between the TSV and the Parquet output. Apart from the different order of columns, some entries are present in one format but not in the other. label seems to be named is_decoy in Parquet file.

Are the differences intentional? For me it would be more convenient if they could be unified.

Best, Ralf

Parquet TSV
aligned_rt aligned_rt
calcmass calcmass
charge charge
delta_best delta_best
delta_mobility delta_mobility
delta_next delta_next
delta_rt_model delta_rt_model
expmass expmass
filename filename
fragment_ppm fragment_ppm
hyperscore hyperscore
ion_mobility ion_mobility
is_decoy
isotope_error isotope_error
  label
longest_b longest_b
longest_y longest_y
longest_y_pct longest_y_pct
matched_intensity_pct matched_intensity_pct
matched_peaks matched_peaks
missed_cleavages missed_cleavages
  ms2_intensity
num_proteins num_proteins
peptide peptide
peptide_len peptide_len
peptide_q peptide_q
poisson poisson
posterior_error posterior_error
precursor_ppm precursor_ppm
predicted_mobility predicted_mobility
predicted_rt predicted_rt
protein_q protein_q
proteins proteins
psm_id psm_id
rank rank
reporter_ion_intensity  
rt rt
sage_discriminant_score sage_discriminant_score
scannr scannr
scored_candidates scored_candidates
  semi_enzymatic
spectrum_q spectrum_q
stripped_peptide  
lazear commented 5 months ago

Hi Ralf,

Parquet is (imo) a significantly better output format since it allows us to annotate the datatypes for each column and add nested datatypes (for example, reporter_ion_intensity is actually stored as a list of floats, rather than a stringified representation so it can be directly loaded as such by Python, R, and we don't need to write a variable number of columns depending on the TMTplex used).

I don't really want to guarantee that the two output formats will have identical sets of columns - but the parquet format should ideally be a superset of those present in the TSV file.

Since users can request a separate percolator-compatible output file, the label field could be removed and is_decoy added to the TSV output instead.

RalfG commented 5 months ago

Thanks for the info! Fully agreed that Parquet is the better of formats, except maybe in the context of long-term archival. Having the Parquet file as a super set does make sense.