Summarize some, but not all, variables

iaindillingham commented 3 years ago

Problem: study definitions often have many variables, but researchers do not want to summarize all of them.

Related: The variable_types property is a mapping of variable names to variable types. It's required for csv and csv.gz input files (see also #55). It is not required for other types of input files, such as feather input files. The names/types information within variable_types takes precedence for other types of input files: that is, if a researcher provides a feather input file and variable_types, then the types in the former will be cast (transformed) to the types in the latter. However, variable_types does not filter the variables: all variables in the input file must be present in variable_types, or an error is raised.

Solution: allow researchers to summarize some, but not all, variables. Remove the distinction between csv and csv.gz input files and other types of input files. Replace variable_types with an optional variables property:

If variables is a mapping, then the keys should be the names and the values the types. All input files will be filtered and cast using the provided names/types.
If variables is a list, then the elements should be the names. All input files will be filtered using the provided names. csv and csv.gz files will be cast to the StringDtype. Other types of input files will not be cast.
If variables is omitted, then the input files will not be filtered. csv and csv.gz files will be cast to the StringDtype.

An example of a mapping:

actions:
  # Snip snip
  generate_report:
    run: cohort-report:v?.?.? output/input.csv
    needs: [generate_study_population]
    config:
      variables:
          age: int
          sex: categorical
          bmi: float
          diabetes: binary
    # Snip snip

An example of a list:

actions:
  # Snip snip
  generate_report:
    run: cohort-report:v?.?.? output/input.feather
    needs: [generate_study_population]
    config:
      variables:
          age
          sex
          bmi
          diabetes
    # Snip snip

See also:

55
48
44
34
23

HelenCEBM commented 3 years ago

Sounds great to me. One question - why do all fields have to be cast to string for csv/csv.gz? I thought that when loading a csv into python it infers dtypes where possible...

iaindillingham commented 3 years ago

Good point @HelenCEBM! cohort-report uses Pandas' read_csv to read CSVs, which seems to infer the data types. (I can't find this behaviour documented, though, and the dtype, converters, and parse_dates arguments suggest otherwise.) So, let's not cast all variables to StringDtype, only those variables that are of type object.

Doing so means we can tell cohort-report to handle StringDtype columns as discrete columns (group by/count; bar charts not histograms; etc.). Should a variable of type object be passed in the future, then cohort-report will again raise an assertion error, which we'd want to investigate.

LisaHopcroft commented 3 years ago

This also sounds good to me!

opensafely-actions / cohort-report

Summarize some, but not all, variables #56

55

48

44

34

23