Closed CarlinLiao closed 1 year ago
As for (1), the string values for all identifiers in the scstudies
schema is a deliberate design choice, motivated by the aim of consistency across the schema for all identifiers. This greatly simplifies schema authoring and inter-table referencing. histological_structure
is no exception and the apparent integrality of its identifiers is an implementation-specific artifact of the identifier-issuing portion of the (SPT) data import process. There are several schema alterations and additions that SPT does for performance purposes, since the ADI schema does not prioritize database performance, but I prefer to use such a mechanism only if there is a genuine performance-related purpose.
As for (2), "channel expression being expressed as 0/1" happens in only a very slim intermediate processing step. As I noted elsewhere in comments, the database tables do not use 0/1 values for expression, and boolean values are supported by the database but not used, since the expression values in the schema are not necessarily dichotomous.
Moreover the aim of consistency between the storage format in the database and the feature matrices' values is not highly prioritized, because they have different semantics. In the feature matrix dichotomous values is the paradigm, and in the database storage format this is not so. The inconsistency has a reason. ADI-compliant datasets could use trinary expression values, for example, like "high/low/absent", which would cause SPT's feature matrix functionality to fail, but that is SPT's problem not scstudies
' problem.
Closing for now pending a proposal for followup action.
There are at least two instances of this I'd like to point out
histological_structure
being stored as string/VARCHAR in the database but coerced and assumed to beint
in more recent features, like selection of cells by ID inFeatureMatrixExtractor
. The database schema can be updated to canonize theint
format.I think one or both of these situations would benefit from consistency.