mmcdermott / EventStreamGPT

Dataset and modelling infrastructure for modelling "event streams": sequences of continuous time, multivariate events with complex internal dependencies.
https://eventstreamml.readthedocs.io/en/latest/
MIT License
94 stars 15 forks source link

nvalid series dtype: expected `Utf8`, got `datetime[ns]` #106

Closed rvandewater closed 1 month ago

rvandewater commented 2 months ago

Hi,

I am getting the error below when adjusting the local data example for my own data. I am wondering if there is some indication in the documentation what column data types are required or the config requirements? I can also privately give more information on the input data. Thanks!

python ./scripts/build_dataset.py --config-path="$(pwd)/sample_data/" --config-name=custom_dataset "hydra.searchpath=[$(pwd)/configs]"
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/vandewrp/projects/ESGPT/EventStream/data/dataset_base.py", line 271, in build_event_and_measurement_dfs
    new_events = cls._inc_df_col(events, "event_id", running_event_id_max)
  File "/home/vandewrp/projects/ESGPT/EventStream/data/dataset_polars.py", line 394, in _inc_df_col
    return df.with_columns(pl.col(col) + inc_by).collect(streaming=cls.STREAMING)
  File "/home/vandewrp/.conda/envs/esgpt/lib/python3.10/site-packages/polars/utils/deprecation.py", line 93, in wrapper
    return function(*args, **kwargs)
  File "/home/vandewrp/.conda/envs/esgpt/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1695, in collect
    return wrap_df(ldf.collect())
exceptions.SchemaError: invalid series dtype: expected `Utf8`, got `datetime[ns]`

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/vandewrp/projects/ESGPT/./scripts/build_dataset.py", line 372, in <module>
    main()
  File "/home/vandewrp/.conda/envs/esgpt/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/vandewrp/.conda/envs/esgpt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/vandewrp/.conda/envs/esgpt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/vandewrp/.conda/envs/esgpt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/vandewrp/.conda/envs/esgpt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/vandewrp/.conda/envs/esgpt/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/vandewrp/.conda/envs/esgpt/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/vandewrp/.conda/envs/esgpt/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/vandewrp/.conda/envs/esgpt/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/vandewrp/projects/ESGPT/./scripts/build_dataset.py", line 364, in main
    ESD = Dataset(config=config, input_schema=dataset_schema)
  File "/home/vandewrp/projects/ESGPT/EventStream/data/dataset_base.py", line 550, in __init__
    events_df, dynamic_measurements_df = self.build_event_and_measurement_dfs(
  File "/home/vandewrp/projects/ESGPT/EventStream/data/dataset_base.py", line 273, in build_event_and_measurement_dfs
    raise ValueError(f"Failed to increment event_id on {event_type}") from e
ValueError: Failed to increment event_id on ADMISSION

Config.yml (removed save/data dir for privacy reasons)

defaults:
  - dataset_base
  - _self_

# So that it can be run multiple times without issue.
do_overwrite: True

cohort_name: "sample"
subject_id_col: "id"
raw_data_dir:
save_dir:

DL_chunk_size: null

inputs:
  subjects:
    input_df: "${raw_data_dir}/base_data.parquet"
  admissions:
    input_df: "${raw_data_dir}/base_data.parquet"
    start_ts_col: "Aufnahme"
    end_ts_col: "Entlassung"
    ts_format: "Y/%m/%d/, %H:%M:%S"
#    event_type: ["OUTPATIENT_VISIT", "ADMISSION", "DISCHARGE"]
  vitals:
    input_df: "${raw_data_dir}/observations.parquet"
    ts_col: "datetime"
    ts_format: "Y/%m/%d/, %H:%M:%S"
#  labs:
#    input_df: "${raw_data_dir}/labs.csv"
#    ts_col: "timestamp"
#    ts_format: "Y/%m/%d/, %H:%M:%S"

measurements:
  static:
    single_label_classification:
      subjects: ["90_day_mortality", "30_day_mortality"]
#  functional_time_dependent:
#    age:
#      functor: AgeFunctor
#      necessary_static_measurements: { "dob": ["timestamp", "Y/%m/%d/, %H:%M:%S"] }
#      kwargs: { dob_col: "dob" }
#  dynamic:
#    multi_label_classification:
#      admissions: ["department"]
#    univariate_regression:
#      vitals: ["hr", "temp"]
#    multivariate_regression:
#      labs: [["lab_name", "lab_value"]]

outlier_detector_config:
  stddev_cutoff: 1.5
min_valid_vocab_element_observations: 5
min_valid_column_observations: 5
min_true_float_frequency: 0.1
min_unique_numerical_observations: 20
min_events_per_subject: 3
agg_by_time_scale: "1h"
rvandewater commented 2 months ago

Okay, my suspicion is that the time for the measurements and admission/discharge are in the datetime[ns] format. Is it possible to specify the config such that I can use these directly or do I have to convert them to string before running the script?

juancq commented 1 month ago

If those fields are already in datetime format, then don't include the ts_format field in the yaml. The ts_format field is only necessary for dates stored in strings.