goodarzilab / pypage

python implementation of the PAGE algorithm
MIT License
15 stars 2 forks source link

Generate Bins on ExpressionProfile after Finding Gene Intersection #34

Open noamteyssier opened 2 years ago

noamteyssier commented 2 years ago

This is to address the conversation from @artemy-bakulin original commit fe688806bb7dc545b16bc1cf7c00a7360993cdc1 Artemy brings up the point about class imbalance here: https://github.com/noamteyssier/pypage/issues/33#issuecomment-1167958248

I agree with the new order of operations, but we need to address the following bug in this commits current form.

The Bug

The current form will return a bin_array of size n_genes regardless of the size of the gene subset provided.

currently fails the following test:

N_GENES=1000
T = 100

def get_expression() -> (np.ndarray, np.ndarray):
    genes = np.array([f"g.{g}" for g in np.arange(N_GENES)])
    scores = np.random.normal(size=N_GENES)
    return genes, scores

def test_subsetting():
    for _ in np.arange(T):
        genes, expression = get_expression()

        exp = ExpressionProfile(genes, expression)
        subset = genes[np.random.random(genes.size) < 0.5]

        bin_sub = exp.get_gene_subset(subset)
        assert bin_sub.size == subset.size

Solution

Could be fixed by adjusting _build_bool_array or _build_bin_array by subsetting those with unset indices (initializing bool_array with np.full(-1) instead of np.zeros)

Leaving this open for now, and will circle back once the rest of the merge is complete

noamteyssier commented 2 years ago

added the test explicitly to test_expression.py 7149ffac2f4857a06a0d9c4b55e64a89b67f7729