Closed RalfG closed 3 months ago
Hi Ralf,
Parquet is (imo) a significantly better output format since it allows us to annotate the datatypes for each column and add nested datatypes (for example, reporter_ion_intensity
is actually stored as a list of floats, rather than a stringified representation so it can be directly loaded as such by Python, R, and we don't need to write a variable number of columns depending on the TMTplex used).
is_decoy
column is a boolean to ease interpretation, instead of label == 1
or label == -1
. Using label
is really a legacy holdover from when the TSV output could be directly loaded into percolator for rescoring.semi_enzymatic
& ms2_intensity
missing from the parquet file is an oversight!I don't really want to guarantee that the two output formats will have identical sets of columns - but the parquet format should ideally be a superset of those present in the TSV file.
Since users can request a separate percolator-compatible output file, the label
field could be removed and is_decoy
added to the TSV output instead.
Thanks for the info! Fully agreed that Parquet is the better of formats, except maybe in the context of long-term archival. Having the Parquet file as a super set does make sense.
Hi Michael,
While adding support for Sage Parquet result file output to psm_utils, I noticed some discrepancies between the TSV and the Parquet output. Apart from the different order of columns, some entries are present in one format but not in the other.
label
seems to be namedis_decoy
in Parquet file.Are the differences intentional? For me it would be more convenient if they could be unified.
Best, Ralf