opensafely-actions / cohort-report

Cohort-report generates a report for each variable in an input file
MIT License
0 stars 0 forks source link

Summarize some, but not all, variables #56

Open iaindillingham opened 3 years ago

iaindillingham commented 3 years ago

Problem: study definitions often have many variables, but researchers do not want to summarize all of them.

Related: The variable_types property is a mapping of variable names to variable types. It's required for csv and csv.gz input files (see also #55). It is not required for other types of input files, such as feather input files. The names/types information within variable_types takes precedence for other types of input files: that is, if a researcher provides a feather input file and variable_types, then the types in the former will be cast (transformed) to the types in the latter. However, variable_types does not filter the variables: all variables in the input file must be present in variable_types, or an error is raised.

Solution: allow researchers to summarize some, but not all, variables. Remove the distinction between csv and csv.gz input files and other types of input files. Replace variable_types with an optional variables property:

An example of a mapping:

actions:
  # Snip snip
  generate_report:
    run: cohort-report:v?.?.? output/input.csv
    needs: [generate_study_population]
    config:
      variables:
          age: int
          sex: categorical
          bmi: float
          diabetes: binary
    # Snip snip

An example of a list:

actions:
  # Snip snip
  generate_report:
    run: cohort-report:v?.?.? output/input.feather
    needs: [generate_study_population]
    config:
      variables:
          age
          sex
          bmi
          diabetes
    # Snip snip

See also:

HelenCEBM commented 3 years ago

Sounds great to me. One question - why do all fields have to be cast to string for csv/csv.gz? I thought that when loading a csv into python it infers dtypes where possible...

iaindillingham commented 3 years ago

Good point @HelenCEBM! cohort-report uses Pandas' read_csv to read CSVs, which seems to infer the data types. (I can't find this behaviour documented, though, and the dtype, converters, and parse_dates arguments suggest otherwise.) So, let's not cast all variables to StringDtype, only those variables that are of type object.

Doing so means we can tell cohort-report to handle StringDtype columns as discrete columns (group by/count; bar charts not histograms; etc.). Should a variable of type object be passed in the future, then cohort-report will again raise an assertion error, which we'd want to investigate.

LisaHopcroft commented 3 years ago

This also sounds good to me!