I agree with the new order of operations, but we need to address the following bug in this commits current form.
The Bug
The current form will return a bin_array of size n_genes regardless of the size of the gene subset provided.
currently fails the following test:
N_GENES=1000
T = 100
def get_expression() -> (np.ndarray, np.ndarray):
genes = np.array([f"g.{g}" for g in np.arange(N_GENES)])
scores = np.random.normal(size=N_GENES)
return genes, scores
def test_subsetting():
for _ in np.arange(T):
genes, expression = get_expression()
exp = ExpressionProfile(genes, expression)
subset = genes[np.random.random(genes.size) < 0.5]
bin_sub = exp.get_gene_subset(subset)
assert bin_sub.size == subset.size
Solution
Could be fixed by adjusting _build_bool_array or _build_bin_array by subsetting those with unset indices (initializing bool_array with np.full(-1) instead of np.zeros)
Leaving this open for now, and will circle back once the rest of the merge is complete
This is to address the conversation from @artemy-bakulin original commit fe688806bb7dc545b16bc1cf7c00a7360993cdc1 Artemy brings up the point about class imbalance here: https://github.com/noamteyssier/pypage/issues/33#issuecomment-1167958248
I agree with the new order of operations, but we need to address the following bug in this commits current form.
The Bug
The current form will return a
bin_array
of sizen_genes
regardless of the size of the gene subset provided.currently fails the following test:
Solution
Could be fixed by adjusting
_build_bool_array
or_build_bin_array
by subsetting those with unset indices (initializingbool_array
withnp.full(-1)
instead ofnp.zeros
)Leaving this open for now, and will circle back once the rest of the merge is complete