Data type inconsistencies between database, processing, and output

CarlinLiao commented 1 year ago

There are at least two instances of this I'd like to point out

histological_structure being stored as string/VARCHAR in the database but coerced and assumed to be int in more recent features, like selection of cells by ID in FeatureMatrixExtractor. The database schema can be updated to canonize the int format.
Channel expression being expressed as 0/1 values for the sake of data expression but interpreted as bool for the purposes of DataFrame selection. Discussed to some extent in this PR. Instead of coercing to bool when we want to use it for DataFrame indexing, they could be converted to bool as soon as possible, and only stay ints when coming out of the postgres query and when saving and loading from compressed bytes file.

I think one or both of these situations would benefit from consistency.

jimmymathews commented 1 year ago

As for (1), the string values for all identifiers in the scstudies schema is a deliberate design choice, motivated by the aim of consistency across the schema for all identifiers. This greatly simplifies schema authoring and inter-table referencing. histological_structure is no exception and the apparent integrality of its identifiers is an implementation-specific artifact of the identifier-issuing portion of the (SPT) data import process. There are several schema alterations and additions that SPT does for performance purposes, since the ADI schema does not prioritize database performance, but I prefer to use such a mechanism only if there is a genuine performance-related purpose.

As for (2), "channel expression being expressed as 0/1" happens in only a very slim intermediate processing step. As I noted elsewhere in comments, the database tables do not use 0/1 values for expression, and boolean values are supported by the database but not used, since the expression values in the schema are not necessarily dichotomous.

Moreover the aim of consistency between the storage format in the database and the feature matrices' values is not highly prioritized, because they have different semantics. In the feature matrix dichotomous values is the paradigm, and in the database storage format this is not so. The inconsistency has a reason. ADI-compliant datasets could use trinary expression values, for example, like "high/low/absent", which would cause SPT's feature matrix functionality to fail, but that is SPT's problem not scstudies' problem.

jimmymathews commented 1 year ago

Closing for now pending a proposal for followup action.

nadeemlab / SPT

Data type inconsistencies between database, processing, and output #216