Open iaindillingham opened 3 years ago
Sounds great to me. One question - why do all fields have to be cast to string for csv/csv.gz? I thought that when loading a csv into python it infers dtypes where possible...
Good point @HelenCEBM! cohort-report uses Pandas' read_csv
to read CSVs, which seems to infer the data types. (I can't find this behaviour documented, though, and the dtype
, converters
, and parse_dates
arguments suggest otherwise.) So, let's not cast all variables to StringDtype
, only those variables that are of type object
.
Doing so means we can tell cohort-report to handle StringDtype
columns as discrete columns (group by/count; bar charts not histograms; etc.). Should a variable of type object
be passed in the future, then cohort-report will again raise an assertion error, which we'd want to investigate.
This also sounds good to me!
Problem: study definitions often have many variables, but researchers do not want to summarize all of them.
Related: The
variable_types
property is a mapping of variable names to variable types. It's required for csv and csv.gz input files (see also #55). It is not required for other types of input files, such as feather input files. The names/types information withinvariable_types
takes precedence for other types of input files: that is, if a researcher provides a feather input file andvariable_types
, then the types in the former will be cast (transformed) to the types in the latter. However,variable_types
does not filter the variables: all variables in the input file must be present invariable_types
, or an error is raised.Solution: allow researchers to summarize some, but not all, variables. Remove the distinction between csv and csv.gz input files and other types of input files. Replace
variable_types
with an optionalvariables
property:variables
is a mapping, then the keys should be the names and the values the types. All input files will be filtered and cast using the provided names/types.variables
is a list, then the elements should be the names. All input files will be filtered using the provided names. csv and csv.gz files will be cast to theStringDtype
. Other types of input files will not be cast.variables
is omitted, then the input files will not be filtered. csv and csv.gz files will be cast to theStringDtype
.An example of a mapping:
An example of a list:
See also:
55
48
44
34
23