This is to address the conversation from @artemy-bakulin original commit fe688806bb7dc545b16bc1cf7c00a7360993cdc1 Artemy brings up the point about class imbalance here: https://github.com/noamteyssier/pypage/issues/33#issuecomment-1167958248

I agree with the new order of operations, but we need to address the following bug in this commits current form.

The Bug

The current form will return a bin_array of size n_genes regardless of the size of the gene subset provided.

currently fails the following test:

N_GENES=1000
T = 100

def get_expression() -> (np.ndarray, np.ndarray):
    genes = np.array([f"g.{g}" for g in np.arange(N_GENES)])
    scores = np.random.normal(size=N_GENES)
    return genes, scores

def test_subsetting():
    for _ in np.arange(T):
        genes, expression = get_expression()

        exp = ExpressionProfile(genes, expression)
        subset = genes[np.random.random(genes.size) < 0.5]

        bin_sub = exp.get_gene_subset(subset)
        assert bin_sub.size == subset.size

Solution

Could be fixed by adjusting _build_bool_array or _build_bin_array by subsetting those with unset indices (initializing bool_array with np.full(-1) instead of np.zeros)

Leaving this open for now, and will circle back once the rest of the merge is complete

goodarzilab / pypage

Generate Bins on ExpressionProfile after Finding Gene Intersection #34

The Bug

Solution