munichpavel / fake-data-for-learning

Sample interesting fake data for machine and human learning
https://munichpavel.github.io/fake-data-for-learning
MIT License
7 stars 0 forks source link

Ancestral sampling and pmf are inconsistent #13

Closed munichpavel closed 4 years ago

munichpavel commented 4 years ago

Failing test:

def test_rvs_counts():
    pt_X0 = np.array([0.5, 0.5])
    X0 = BayesianNodeRV('X0', pt_X0)
    pt_X1cX0 = np.array([
            [0.25, 0.75],
            [0.75, 0.25],
    ])
    X1 = BayesianNodeRV('X1', pt_X1cX0, parent_names=['X0'])

    bn = FakeDataBayesianNetwork(X0, X1)
    samples = bn.rvs(size=10000, seed=42)
    sample_ratios = samples.groupby(['X0', 'X1']).size() / samples.shape[0]

    expected_index = pd.MultiIndex.from_tuples(
        [(0,0), (0,1), (1,0), (1,1)],
        names=['X0', 'X1'])
    expected_ratios = pd.Series(
        [0.125, 0.375, 0.375, 0.125],
        index=expected_index
    )
    pd.testing.assert_series_equal(
        sample_ratios, expected_ratios,
        check_exact=False, check_less_precise=4
    )
munichpavel commented 4 years ago

Even more obvious that ancestral sampling is wrong:

# X0, X1 independent
    X0 = BayesianNodeRV('X0', np.array([0.5, 0.5]))
    X1 = BayesianNodeRV('X1', np.array([0.5, 0.5]))

    bn = FakeDataBayesianNetwork(X0, X1)
    samples = bn.rvs(size=10000, seed=42)
    sample_ratios = samples.groupby(['X0', 'X1']).size() / samples.shape[0]

    expected_index = pd.MultiIndex.from_tuples(
        [(0,0), (0,1), (1,0), (1,1)],
        names=['X0', 'X1'])
    expected_ratios = pd.Series(
        [0.25, 0.25, 0.25, 0.25],
        index=expected_index
    )
    pd.testing.assert_series_equal(
        sample_ratios, expected_ratios,
        check_exact=False, check_less_precise=4
    )

Error message:

E       AssertionError: Series are different
E       
E       Series length are different
E       [left]:  2, MultiIndex([(0, 0),
E                   (1, 1)],
E                  names=['X0', 'X1'])
E       [right]: 4, MultiIndex([(0, 0),
E                   (0, 1),
E                   (1, 0),
E                   (1, 1)],
E                  names=['X0', 'X1'])

So the outcomes (0,1) and (1,0) do not appear in ancestral sampling.

munichpavel commented 4 years ago

The issue was with setting the seeds, as in #12. Decided to just drop this feature as it had caused more trouble than it was worth.

Closed at https://github.com/munichpavel/fake-data-for-learning/commit/2dbbf284f7d0bcc076dec16d3e9a1378ecf74363