mmcdermott / EventStreamGPT

Dataset and modelling infrastructure for modelling "event streams": sequences of continuous time, multivariate events with complex internal dependencies.
https://eventstreamml.readthedocs.io/en/latest/
MIT License
102 stars 16 forks source link

Pipeline for the UNIVARIATE_REGRESSION and MULTIVARIATE_REGRESSION tasks #39

Closed sujeongim closed 1 year ago

sujeongim commented 1 year ago

In the Measurement section, there are two modalities: UNIVARIATE_REGRESSION and MULTIVARIATE_REGRESSION.

So I thought that your model also considered observed values of type INT and FLOAT. However, I couldn't find information regarding two specific aspects:

In other words, I feel like I haven't fully grasped the process involved in the UNIVARIATE_REGRESSION and MULTIVARIATE_REGRESSION tasks.

Could you please guide me to the relevant code or documentation that explains the UNIVARIATE_REGRESSION and MULTIVARIATE_REGRESSION tasks, including both the model's prediction and evaluation processes?

I apologize if I missed any important details and I appreciate your assistance. Thank you in advance.

mmcdermott commented 1 year ago

Hi @sujeongim; thanks for the great question. Answers below, but please feel free to follow up if you have further questions!

Univariate vs. Multivariate

Univariate vs. multivariate regression impacts both (a) how data are pre-processed and (b) how data are generated by the model. In particular, if you specify that a measurement is of modality UNIVARIATE_REGRESSION, then the pipeline will look in the appropriate source table for a column with the name given by the measurement name containing numerical values, and it will perform outlier detection and normalization on that entire column. See, for example, the HR and temp measurements and the admit_vitals file in the synthetic tutorial on the dev branch: https://eventstreamml.readthedocs.io/en/dev/_collections/local_tutorial_notebook.html#admit-vitals-csv https://eventstreamml.readthedocs.io/en/dev/_collections/local_tutorial_notebook.html#telling-the-pipeline-what-to-do-input-config

In contrast, multivariate measurements expect that there will be a categorical column in the underlying source with the same name as the measurement and a corresponding values column as indicated in the measurement config containing numerical data, then the data will be normalized and outlier-detectors learned on a per-key basis.

See the lab_name measurement in the same tutorial and the corresponding labs.csv file: https://eventstreamml.readthedocs.io/en/dev/_collections/local_tutorial_notebook.html#labs-csv

This is also discussed here, though it may need more details: https://eventstreamml.readthedocs.io/en/dev/usage.html#measurement-observation-data-modality

On the modelling side, there isn't good documentation yet about how these differ, but in practice they just require slightly different loss functions and more careful handling during generation. If you'd like, I can point you to locations in the source code that are relevant, but there isn't good documentation yet.

INT vs. FLOAT types.

The NumericDataModalitySubtype enum dictates what kind of numerical value a numerical measurement (either univariate or multivariate) contains. This is done on a case-by-case basis based on the data itself, with support from the configuration options. If only a very small fraction (less than config.min_true_float_frequency) of numerical observations take on non-integral values, then the pipeline will assume the data should be integral and convert it to that prior to subsequent processing. If there are only a small number of unique values, with lots of repetition (less than config.min_unique_numerical_observations) then the value will be flagged as categorical and its values will be dropped in favor of expanding (or adding new) possible categorical keys associated with that measure. The reason for both of these options is because (a) some numerically encoded values (such as ordinal scales, like the SOFA score in intensive care unit settings) are actually categorical in nature, and should be valued as such and (b) dropping unnecessary precision can help ease the modelling task downstream, though none of our current models take advantage of that in practice. Integer valued observations are still represented as floating point inputs to downstream modelling stages, in the current incarnation, but that conversion does impact outlier detector and normalization options. The values are also rounded during pre-processing, too, if they were converted to ints.

This happens in the code here: https://github.com/mmcdermott/EventStreamGPT/blob/main/EventStream/data/dataset_polars.py#L765

There is a reasonable docstring there, but it isn't in any of the main documentation yet. I'll expand on that in the next update to the new tutorial in the dev branch.

mmcdermott commented 1 year ago

I'm closing this issue for now, but please feel free to re-open if you have any further questions!