lazear / mz_parquet

MIT License
12 stars 0 forks source link

RFC: mzparquet file format, additional considerations #1

Open lazear opened 1 year ago

lazear commented 1 year ago

This is an open discussion for modifications to the file format, suggestions as to what set of fields are minimally necessary to cover most analytical and data processing use cases

radusuciu commented 1 year ago

One field (or set of fields) that might be good to consider supporting is noise intensity. ThermoRawFileParser optionally outputs it alongside a few other arrays. Here's an example:

<binaryDataArrayList count="5">
  <binaryDataArray encodedLength="416">
    <cvParam cvRef="MS" accession="MS:1000514" value="" name="m/z array" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
    <cvParam cvRef="MS" accession="MS:1000523" value="" name="64-bit float" />
    <cvParam cvRef="MS" accession="MS:1000574" value="" name="zlib compression" />
    <binary><!-- snipped -->></binary>
  </binaryDataArray>
  <binaryDataArray encodedLength="476">
    <cvParam cvRef="MS" accession="MS:1000515" value="" name="intensity array" unitAccession="MS:1000131" unitName="number of counts" unitCvRef="MS" />
    <cvParam cvRef="MS" accession="MS:1000523" value="" name="64-bit float" />
    <cvParam cvRef="MS" accession="MS:1000574" value="" name="zlib compression" />
    <binary><!-- snipped -->></binary>
  </binaryDataArray>
  <binaryDataArray encodedLength="456">
    <cvParam cvRef="MS" accession="MS:1002745" value="" name="sampled noise baseline array" />
    <cvParam cvRef="MS" accession="MS:1000523" value="" name="64-bit float" />
    <cvParam cvRef="MS" accession="MS:1000574" value="" name="zlib compression" />
    <binary><!-- snipped -->></binary>
  </binaryDataArray>
  <binaryDataArray encodedLength="412">
    <cvParam cvRef="MS" accession="MS:1002744" value="" name="sampled noise intensity array" unitAccession="MS:1000131" unitName="number of detector counts" unitCvRef="MS" />
    <cvParam cvRef="MS" accession="MS:1000523" value="" name="64-bit float" />
    <cvParam cvRef="MS" accession="MS:1000574" value="" name="zlib compression" />
    <binary><!-- snipped -->></binary>
  </binaryDataArray>
  <binaryDataArray encodedLength="416">
    <cvParam cvRef="MS" accession="MS:1002743" value="" name="sampled noise m/z array" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
    <cvParam cvRef="MS" accession="MS:1000523" value="" name="64-bit float" />
    <cvParam cvRef="MS" accession="MS:1000574" value="" name="zlib compression" />
    <binary><!-- snipped -->></binary>
  </binaryDataArray>
</binaryDataArrayList>
lazear commented 1 year ago

Yeah this is a good point - maybe another optional column group at the end, where binary data arrays can be stored?

optional group cv_binary_data (list) {
        repeated group list {
            required group element {
                required byte_array accession (utf8);
                required group value (list) {
                    ...
                }
            }
        }
    }
radusuciu commented 1 year ago

I think that makes sense, but maybe you could hew a little closer to the mzML standard (in terms of naming things) to minimize confusion? See: https://peptideatlas.org/tmp/mzML1.1.0.html#binaryDataArray

If I understand your intent correctly you're not interested in supporting the entire mzML standard - just enough that you can cover your bases with sage?

Out of curiosity, I prompted Claude to generate a translation of the mzML standard .xsd to a parquet schema and it seems to have done a pretty decent job at first glance.

```python import pyarrow as pa schema = pa.schema([ ("cvList", pa.list_(pa.struct([ ("id", pa.string()), ("fullName", pa.string()), ("version", pa.string()), ("URI", pa.string()) ]))), ("fileDescription", pa.struct([ ("fileContent", pa.struct([])), # ParamGroup ("sourceFileList", pa.struct([ ("sourceFile", pa.list_(pa.struct([ ("id", pa.string()), ("name", pa.string()), ("location", pa.string()), ("fileFormat", pa.struct([])) # ParamGroup ]))) ])), ("contacts", pa.list_(pa.struct([]))) # ParamGroups ])), ("referenceableParamGroupList", pa.struct([ ("referenceableParamGroup", pa.list_(pa.struct([ ("id", pa.string()), ("cvParam", pa.list_(# CVParam pa.struct(["name", "value", "unitAccession", "unitName", "unitCvRef"]) )), ("userParam", pa.list_(# UserParam pa.struct(["name", "type", "value", "unitAccession", "unitName", "unitCvRef"]) )) ]))) ])), ("sampleList", pa.struct([ ("sample", pa.list_(pa.struct([ ("id", pa.string()), ("name", pa.string()), ("cvParam", pa.list_(# CVParam)), ("userParam", pa.list_(# UserParam)), ]))) ])), ("softwareList", pa.struct([ ("software", pa.list_(pa.struct([ ("id", pa.string()), ("version", pa.string()), ("cvParam", pa.list_(# CVParam)), ("userParam", pa.list_(# UserParam)), ]))) ])), ("scanSettingsList", pa.struct([ ("scanSettings", pa.list_(pa.struct([ ("id", pa.string()), ("cvParam", pa.list_(# CVParam)), ("userParam", pa.list_(# UserParam)), ("sourceFileRefList", pa.struct([ # SourceFileRefListType ("sourceFileRef", pa.list_(pa.field("ref", pa.string()))) ])), ("targetList", pa.struct([ # TargetListType ("target", pa.list_(pa.struct([]))) # ParamGroup ])) ]))) ])), ("instrumentConfigurationList", pa.struct([ ("instrumentConfiguration", pa.list_(pa.struct([ ("id", pa.string()), ("cvParam", pa.list_(# CVParam)), ("userParam", pa.list_(# UserParam)), ("componentList", pa.struct([ # ComponentListType ("source", pa.list_(pa.struct([]))), # SourceComponentType ("analyzer", pa.list_(pa.struct([]))), # AnalyzerComponentType ("detector", pa.list_(pa.struct([]))) # DetectorComponentType ])), ("softwareRef", pa.field("ref", pa.string())) # SoftwareRefType ]))) ])), ("dataProcessingList", pa.struct([ ("dataProcessing", pa.list_(pa.struct([ ("id", pa.string()), ("processingMethod", pa.list_(pa.struct([ # ProcessingMethodType ("cvParam", pa.list_(# CVParam)), ("userParam", pa.list_(# UserParam)), ("order", pa.int32()), ("softwareRef", pa.string()) ]))) ]))) ])), ("run", pa.struct([ ("id", pa.string()), ("defaultInstrumentConfigurationRef", pa.string()), ("defaultSourceFileRef", pa.string()), ("sampleRef", pa.string()), ("startTimeStamp", pa.timestamp("us")), ("spectrumList", pa.struct([ # SpectrumListType ("spectrum", pa.list_(pa.struct([ # SpectrumType ("id", pa.string()), ("spotID", pa.string()), ("index", pa.int64()), ("defaultArrayLength", pa.int32()), ("dataProcessingRef", pa.string()), ("sourceFileRef", pa.string()), ("cvParam", pa.list_(# CVParam)), ("userParam", pa.list_(# UserParam)), ("scanList", pa.struct([ # ScanListType ("scan", pa.list_(pa.struct([ # ScanType ("cvParam", pa.list_(# CVParam)), ("userParam", pa.list_(# UserParam)), ("spectrumRef", pa.string()), ("sourceFileRef", pa.string()), ("externalSpectrumID", pa.string()), ("instrumentConfigurationRef", pa.string()) ]))) ])), ("precursorList", pa.struct([ # PrecursorListType ("precursor", pa.list_(pa.struct([ # PrecursorType ("spectrumRef", pa.string()), ("sourceFileRef", pa.string()), ("externalSpectrumID", pa.string()), ("isolationWindow", pa.struct([])), # ParamGroup ("selectedIonList", pa.struct([])), # SelectedIonListType ("activation", pa.struct([])) # ParamGroup ]))) ])), ("productList", pa.struct([ # ProductListType ("product", pa.list_(pa.struct([ # ProductType ("isolationWindow", pa.struct([])) # ParamGroup ]))) ])), ("binaryDataArrayList", pa.struct([ # BinaryDataArrayListType ("binaryDataArray", pa.list_(pa.struct([ # BinaryDataArrayType ("binary", pa.binary()), ("encodedLength", pa.int32()), ("arrayLength", pa.int32()), ("dataProcessingRef", pa.string()), ("cvParam", pa.list_(# CVParam)), ("userParam", pa.list_(# UserParam)) ]))) ])) ]))) ])), ("chromatogramList", pa.struct([ # ChromatogramListType ("chromatogram", pa.list_(pa.struct([ # ChromatogramType ("id", pa.string()), ("index", pa.int64()), ("defaultArrayLength", pa.int32()), ("dataProcessingRef", pa.string()), ("cvParam", pa.list_(# CVParam)), ("userParam", pa.list_(# UserParam)), ("precursor", pa.struct([])), # PrecursorType ("product", pa.struct([])), # ProductType ("binaryDataArrayList", pa.struct([])) # BinaryDataArrayListType ]))) ])) ])) ]) ``` > The key aspects: > - Uses PyArrow lists, structs, timestamps to capture hierarchical XML-like structure > - Flattens data into columns for query/analyze efficiency > - Captures all major elements and attributes from original schema > - Allows nesting CVParams, UserParams throughout > - Leaves some complex types like ParamGroup empty for now > Let me carefully review the PyArrow schema I defined for any additional issues or limitations compared to the full mzML schema: > > - The controlled vocabularies (CVs) are only captured as string references right now. To fully implement CVs, we would want an Enum type in PyArrow mapped to each CV. > > - Some of the complex types like ParamGroup are currently empty structs. To fully implement mzML, we would need to map the possible child elements and attributes for these types. > > - The XML attributes are mapped to PyArrow fields, but attributes like "required" and data types are not captured. We could potentially add this metadata using PyArrow field metadata. > > - Key and keyref uniqueness constraints defined in the XML schema are not enforced in PyArrow. We would need to add custom validation logic to replicate these. > > - XML namespaces are not captured in the PyArrow mapping. > > - Cardinality of elements (minOccurs, maxOccurs) is not enforced in the PyArrow definition. > > - Additional validation rules in the XML schema around IDs and references are not implemented in PyArrow. > > - Attributes like "use" and documentation are not captured in the PyArrow definition. > > - The overall document structure defined in the root mzML element is not preserved. > > - No support for XML attributes like xsi:type to define polymorphic types. > > So in summary, the limitations are mainly around: > > - Enforcing uniqueness constraints > - Preserving namespaces > - Capturing full semantics of attributes > - Polymorphic types > - Documentation > > But I think the hierarchical structure and core elements are represented. Let me know if you see any other major gaps I should focus on addressing!
lazear commented 1 year ago

I think the primary goal is to just have a file format that is query-able in a relatively ergonomic manner, and can take advantage of existing distributed query engines to enable large-scale operations (>100k's of files). I will likely add support to sage to enable use cases like "search all MS2s from all raw files with a precursor between X and Y", since parquet enables easy slicing and concatenation of dataframes.

Nesting the whole mzML schema seems overkill for this specific need (esp wrt ergonomic queries - nesting isn't super well supported in the engines I have tried, and for example, polars doesn't even support the map parquet logical type [dictionary]) - but I do think it would be smart to have consistent naming with mzML. I think unnesting mz and intensity is probably OK, since these will be present for practically all use cases I can think of (search, visualization), and it makes querying easier and faster.

chhh commented 1 year ago

No units of measurement?

1) Specify in the name of variable required float scan_start_time_minutes maybe? (historically we're on the minutes time scale with LC-MS). Ideally those should be in the type of course, but that unnecessarily complicates things for different languages. Specifying in a separate element - another unnecessary complication making parsing/querying harder and more error prone.

2) optional float inverse_ion_mobility - this one is trickier, I guess. 1/k0 - are these values really accurate between mass spec vendors? We have Bruker, Waters, Agilent with IM-enabled mass specs. When I was writing code dealing with IM-MS data, what I usually wanted is an IM-index within the file (index of the spectrum withing the "frame"), not the 1/k0 value.

3) required float ion_injection_time -> ion_injection_time_ms

chhh commented 1 year ago

Definition of isolation window is both insufficient and redundant:

required group element {
    required float selected_ion_mz;
    optional int32 selected_ion_charge;
    optional float selected_ion_intensity;
    optional float isolation_window_target;
    optional float isolation_window_lower;
    optional float isolation_window_upper;
    optional byte_array spectrum_ref (utf8);
}

It assumes there's only one ion within the isolation window. This is usually not true. Also, why is selected_ion_mz required? What do you put there for DIA where by definition there's no specific ion being selected? What is isolation_window_target for? Looks like a remnant of mzML legacy.

This section:

required float selected_ion_mz;
optional int32 selected_ion_charge;
optional float selected_ion_intensity;

should probably become an optional struct (message in these terms?) of its own. As well as the isolation window definition should be split out.

Definition of this series of isolation windows also needs a flag to distinguish between MSn events (when isolations are consecutive MS1->MS2->MS3..) and MSX events (as far as I know, only Thermo has that) where events are "parallel", i.e. MS1 ions are co-isolated and a single MS2 spectrum is measured.

optional byte_array spectrum_ref (utf8) - is it really such a good idea to use strings to reference spectra? It is somewhat convenient for debugging initially, but isn't re-numbering all spectra within a file with integers sufficient? Thermo has that numbering by default, for most other vendors a re-numbering scheme needs to be employed.

lazear commented 1 year ago

Excellent points all around. I like the idea of using indexes for spectrum_ref etc, but I think this necessitates the inclusion of an explicit index column as well. Not sure about indexing for IM, or what units to use (not an IM expert) - but I would imagine users might also want the actual values (e.g. for filtering purposes, or fitting prediction models).

As you noted, the isolation window/selected ion stuff is basically a direct copy from mzML - I agree that is redundant. the Selected ion stuff can be eliminated most likely, since I think it is almost always equivalent to the isolation window definition in an mzML (for DIA, the behavior depends on which software writes the mzML. msconvert fills in selected ion info from the filter string/isolation window, IIRC). As currently implemented, isolation_window_lower/upper are tolerances, not edges of the window, hence the need for the target.

I'm not sure that there is a flag needed to distinguish MSn/MSX events - e.g. SPS-MS3 can be encoded perfectly fine as a list of 10 precursors (either selected ion, or isolation windows)

chhh commented 1 year ago

Regarding "perfecltly fine encoding" - the problem is that then it's on the user to properly interpret the values.

I suggest that instead the format should describe as unambiguously as possible the actual thing that happened during acquisition. And what really physically happens is the following sequence:

So to me it looks like there should be repeated mz-ranges with associated accumulation times (I'll mark them as S for Selection in the diagram below), followed by repeated fragmentation events (designated F), and these two sequences can repeat. So e.g. the following should be possible

A universal fragmentation event description is hard to compu up with, those all differ a lot. So I'd resort to just having a short string (not even an enum) for fargmentation type and a single floating point value for a single main parameter of that fragmentation without units (it can mean "energy", "reaction time" etc). These are fine enough details that people rarely care about and it feels like it's not worth it trying to allow too much flexibility.

In code the sequence can be represented by two separate lists, one for isolations (selections), one for fragmentations (activations). Each fragmentation will have a pointer (integer index) to the isolation event.

Each selection event can be augmented with a "list of target ions".

It would also be neat to have a top element in the file-level meta-data with the set of all fragmentation type names used in the file. That is for querying datasets later. E.g. if you are searching for files that have ECD in them.

PS: Making these comments having implemented a few in-house LCMS file formats.

chhh commented 1 year ago

Another major consideration - separation of binary data (spectra) and meta data (file-level and spectrum-level) into separate files. This might save a lot of bandwidth if only meta-info is needed for selecting the files to operate on.

lazear commented 1 year ago

For that level of granularity, separating metadata might be needed, yeah. Parquet files support including metadata at either file or column level, but I'm not sure how well supported this is for most of the common libraries for interacting with parquet (pandas, polars, etc).

Additional considerations include query over-head time - parquet supports pushdown predicates/bloom filters stored in the file. Should be possible to skip any files that are not ECD if it is cleverly encoded as a column. One nice use case (in my mind) for parquet is storing all MS data in a data lake (S3) and running queries directly on the files (e.g. AWS Athena, a distributed serverless SQL engine that can stream from S3).

Perhaps two columns:

This would actually get encoded as 6 physical columns, and should enable bloom filter creation of the fragmentation type column - can just read the parquet metadata header and know to immediately skip that file

gessulat commented 1 year ago

First of all thanks for starting this project. Great idea, it looks really interesting!

What would be your thoughts in factoring out the mz / intensity columns into a separate peaks.parquet file like below?

spectrum_id peak_index mz intensity
1 0 301.13785 930.3335
1 1 301.14514 10505.5625
1 2 301.93048 806.79913
... ... ... ...
1 348 1339.5123 7536.851
1 349 1339.9305 7821.869
1 350 1362.5773 1362.5773

I am coming more from a relational database background and factoring it out would be more normalized in a RDB sense. Querying or joining multiple files should still be easy via SQL engines (Athena, presto, datafusion, ect). Benefits in factoring this out include

What would be benefits to keep the peaks in lists in the mz and intensity colums? I might be totally missing that this is also already easy in the original proposed format. If that's so, I would be happy to learn how! :)

chhh commented 1 year ago

@gessulat Downsides: 1) Spending extra 8 bytes for those 2 indexes (which would doublt the file size if both mz and intensity are 4-byte floats). For timsTOF data this would be particularly bad - data files are big to start with, but you would probably also want to add ion mobility index on top then? 2) Reading a spectrum (as e.g. a search engine would do for an MS2 spectrum) will require a "SELECT" instead of one large buffered read from a file/stream. Typical Thermo files have about 100'000 spectra per hour. Need to measure how much time it will take to perform 100'000 SELECTs where each query returns a set of 10-1000 peaks. 3) The point about annotating ions - you definitely don't want to put that in the "raw LCMS data". You can come up with multiple different annotations for the same file - they should be stored separately.

lazear commented 1 year ago

While I like the idea of multiple files and relational structure (e.g. similar to how Hive/Iceberg tables work, with a JSON document containing metadata and links to the files), I think it will be hard to get buy-in from the community if the goal is to make it an end-user accessible format.

Using a single file has some distinct advantages: parquet files can be sliced or merged and remain valid 'mzparquet' files, and easier to transfer and handle them for less technical users (can't accidentally delete just the peak list file, etc).

In terms of 'long' (column of individual m/z) vs 'wide' (list of m/z), I actually don't think it matters that much from a storage or query perspective - although I am not familiar with the internals of the various query engines. Parquet lays out the data pretty similarly (at least in my understanding), by encoding "repetition levels" specifying which record the value belongs to - which is basically toggled for being part of an array or individual values in a column (0 specifies new record and 1 specifies part of proceeding record up to the most previous 0)

Physical layout on disk for wide format (list of m/z)

data [ 301.13785, 301.14514, 301.93048, 1339.5123 ]
rep  [ 0        , 1        , 1        , 1         ]

Physical layout on disk for long format (column of individual m/z)

data [ 301.13785, 301.14514, 301.93048, 1339.5123 ]
rep  [ 0        , 0        , 0        , 0         ]

I think more of the tradeoffs come from how the format will be used:

gessulat commented 1 year ago

Thank you for the comment and explanations!

1: the spectrum and file indices should compress fairly well, shouldn't they? if a spectrum has 400 peaks on average, hopefully to (1+1)/400 - the +1 storing the lookahead to the next unique index 400 peaks further. If that's so, it might be a reasonable trade-off.

2: Particularly in search engines, it could be helpful to do range queries for precursor mass or retention time. Even if that's not what you need, you can do selects for a batch of spectra. How would a single select for a large batch of spectra (e.g. with duckdb) be different from a large buffered read?

3: you are right that it makes sense to store annotations separately. I still think factoring out peaks would make it easier to integrate those annotations in a separate file when peaks can be queried directly. For that it should not matter if it's one or multiple annotations per peak. I think it is beneficial to design raw formats with flexible integration to downstream tasks in mind. Therefore I don't think it is helpful to clearly separate into raw format and non-raw format.