probcomp / bdbcontrib

BayesDB contributions, including plotting, helper methods, and examples
http://probcomp.csail.mit.edu/bayesdb
Apache License 2.0
9 stars 6 forks source link

in test_one_variable histogram: ValueError: operands could not be broadcast together with shapes (14,) (15,) (14,) #94

Closed riastradh-probcomp closed 8 years ago

riastradh-probcomp commented 8 years ago
______________________________ test_one_variable _______________________________

def test_one_variable():
    (df, bdb) = prepare()
    for var in ['categorical_1', 'few_ints_3', 'floats_3', 'many_ints_4',
        'skewed_numeric_5']:
      cursor = bdb.execute('SELECT %s FROM plottest' % (var,))
      df = cursor_to_df(cursor)
      f = BytesIO()
      do((df, bdb), f, show_contour=False)
      assert has_nontrivial_contents_over_white_background(flush(f))
      cursor = bdb.execute('SELECT %s, categorical_2 FROM plottest' % (var,))
      df = cursor_to_df(cursor)
      f = BytesIO()
>         do((df, bdb), f, colorby='categorical_2', show_contour=False)

tests/test_plot_utils.py:140: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/test_plot_utils.py:89: in do
show_full=False, **kwargs)
build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py:759: in _pairplot
generator_name=generator_name, colors=colors)
build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py:419: in do_hist
kde=do_kde, ax=ax, color=color)
/tmp/riastradh/20151029/local/lib/python2.7/site-packages/seaborn/distributions.py:212: in distplot
color=hist_color, **hist_kws)
/tmp/riastradh/20151029/local/lib/python2.7/site-packages/matplotlib/axes/_axes.py:5678: in hist
m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

a = array([ 2.36098094,  0.        ,  0.        ,  0.        ,  0.        ,
  ...  0.        ,  0.31755309,  0.        ,  0.        ,  0.        ,  0.        ])
bins = 14.177446878757825, range = (0.0, 2.3609809375099999), normed = False
weights = None, density = None

def histogram(a, bins=10, range=None, normed=False, weights=None,
          density=None):
    """
    Compute the histogram of a set of data.

    Parameters
    ----------
    a : array_like
    Input data. The histogram is computed over the flattened array.
    bins : int or sequence of scalars, optional
    If `bins` is an int, it defines the number of equal-width
    bins in the given range (10, by default). If `bins` is a sequence,
    it defines the bin edges, including the rightmost edge, allowing
    for non-uniform bin widths.
    range : (float, float), optional
    The lower and upper range of the bins.  If not provided, range
    is simply ``(a.min(), a.max())``.  Values outside the range are
    ignored.
    normed : bool, optional
    This keyword is deprecated in Numpy 1.6 due to confusing/buggy
    behavior. It will be removed in Numpy 2.0. Use the density keyword
    instead.
    If False, the result will contain the number of samples
    in each bin.  If True, the result is the value of the
    probability *density* function at the bin, normalized such that
    the *integral* over the range is 1. Note that this latter behavior is
    known to be buggy with unequal bin widths; use `density` instead.
    weights : array_like, optional
    An array of weights, of the same shape as `a`.  Each value in `a`
    only contributes its associated weight towards the bin count
    (instead of 1).  If `normed` is True, the weights are normalized,
    so that the integral of the density over the range remains 1
    density : bool, optional
    If False, the result will contain the number of samples
    in each bin.  If True, the result is the value of the
    probability *density* function at the bin, normalized such that
    the *integral* over the range is 1. Note that the sum of the
    histogram values will not be equal to 1 unless bins of unity
    width are chosen; it is not a probability *mass* function.
    Overrides the `normed` keyword if given.

    Returns
    -------
    hist : array
    The values of the histogram. See `normed` and `weights` for a
    description of the possible semantics.
    bin_edges : array of dtype float
    Return the bin edges ``(length(hist)+1)``.

    See Also
    --------
    histogramdd, bincount, searchsorted, digitize

    Notes
    -----
    All but the last (righthand-most) bin is half-open.  In other words, if
    `bins` is::

      [1, 2, 3, 4]

    then the first bin is ``[1, 2)`` (including 1, but excluding 2) and the
    second ``[2, 3)``.  The last bin, however, is ``[3, 4]``, which *includes*
    4.

    Examples
    --------
    >>> np.histogram([1, 2, 1], bins=[0, 1, 2, 3])
    (array([0, 2, 1]), array([0, 1, 2, 3]))
    >>> np.histogram(np.arange(4), bins=np.arange(5), density=True)
    (array([ 0.25,  0.25,  0.25,  0.25]), array([0, 1, 2, 3, 4]))
    >>> np.histogram([[1, 2, 1], [1, 0, 1]], bins=[0,1,2,3])
    (array([1, 4, 1]), array([0, 1, 2, 3]))

    >>> a = np.arange(5)
    >>> hist, bin_edges = np.histogram(a, density=True)
    >>> hist
    array([ 0.5,  0. ,  0.5,  0. ,  0. ,  0.5,  0. ,  0.5,  0. ,  0.5])
    >>> hist.sum()
    2.4999999999999996
    >>> np.sum(hist*np.diff(bin_edges))
    1.0

    """

    a = asarray(a)
    if weights is not None:
    weights = asarray(weights)
    if np.any(weights.shape != a.shape):
        raise ValueError(
        'weights should have the same shape as a.')
    weights = weights.ravel()
    a = a.ravel()

    if (range is not None):
    mn, mx = range
    if (mn > mx):
        raise AttributeError(
        'max must be larger than min in range parameter.')

    # Histogram is an integer or a float array depending on the weights.
    if weights is None:
    ntype = np.dtype(np.intp)
    else:
    ntype = weights.dtype

    # We set a block size, as this allows us to iterate over chunks when
    # computing histograms, to minimize memory usage.
    BLOCK = 65536

    if not iterable(bins):
    if np.isscalar(bins) and bins < 1:
        raise ValueError(
        '`bins` should be a positive integer.')
    if range is None:
        if a.size == 0:
        # handle empty arrays. Can't determine range, so use 0-1.
        range = (0, 1)
        else:
        range = (a.min(), a.max())
    mn, mx = [mi + 0.0 for mi in range]
    if mn == mx:
        mn -= 0.5
        mx += 0.5
    # At this point, if the weights are not integer, floating point, or
    # complex, we have to use the slow algorithm.
    if weights is not None and not (np.can_cast(weights.dtype, np.double) or
                    np.can_cast(weights.dtype, np.complex)):
        bins = linspace(mn, mx, bins + 1, endpoint=True)

    if not iterable(bins):
    # We now convert values of a to bin indices, under the assumption of
    # equal bin widths (which is valid here).

    # Initialize empty histogram
    n = np.zeros(bins, ntype)
    # Pre-compute histogram scaling factor
    norm = bins / (mx - mn)

    # We iterate over blocks here for two reasons: the first is that for
    # large arrays, it is actually faster (for example for a 10^8 array it
    # is 2x as fast) and it results in a memory footprint 3x lower in the
    # limit of large arrays.
    for i in arange(0, len(a), BLOCK):
        tmp_a = a[i:i+BLOCK]
        if weights is None:
        tmp_w = None
        else:
        tmp_w = weights[i:i + BLOCK]

        # Only include values in the right range
        keep = (tmp_a >= mn)
        keep &= (tmp_a <= mx)
        if not np.logical_and.reduce(keep):
        tmp_a = tmp_a[keep]
        if tmp_w is not None:
            tmp_w = tmp_w[keep]
        tmp_a = tmp_a.astype(float)
        tmp_a -= mn
        tmp_a *= norm

        # Compute the bin indices, and for values that lie exactly on mx we
        # need to subtract one
        indices = tmp_a.astype(np.intp)
        indices[indices == bins] -= 1

        # We now compute the histogram using bincount
        if ntype.kind == 'c':
        n.real += np.bincount(indices, weights=tmp_w.real, minlength=bins)
        n.imag += np.bincount(indices, weights=tmp_w.imag, minlength=bins)
        else:
>                   n += np.bincount(indices, weights=tmp_w, minlength=bins).astype(ntype)
E                   ValueError: operands could not be broadcast together with shapes (14,) (15,) (14,)

/tmp/riastradh/20151029/local/lib/python2.7/site-packages/numpy/lib/function_base.py:249: ValueError
riastradh-probcomp commented 8 years ago

% pip freeze Babel==2.1.1 Cython==0.23.4 Jinja2==2.8 MarkupSafe==0.23 Pillow==3.0.0 Pygments==2.0.2 Sphinx==1.3.1 alabaster==0.7.6 argparse==1.2.1 backports.ssl-match-hostname==3.4.0.2 bayeslite==0.1.3rc1 bdbcontrib==0.1.3rc2 certifi==2015.9.6.2 crosscat==0.1.38 decorator==4.0.4 docutils==0.12 funcsigs==0.4 functools32==3.2.3-2 ipykernel==4.1.1 ipython==4.0.0 ipython-genutils==0.1.0 jsonschema==2.5.1 jupyter-client==4.1.1 jupyter-core==4.0.6 markdown2==2.3.0 matplotlib==1.4.3 mistune==0.7.1 mock==1.3.0 nbconvert==4.0.0 nbformat==4.0.1 nose==1.3.7 notebook==4.0.6 numpy==1.10.1 numpydoc==0.5 pandas==0.17.0 path.py==8.1.2 pbr==1.8.1 pexpect==4.0.1 pickleshare==0.5 ptyprocess==0.5 py==1.4.30 pyparsing==2.0.5 pytest==2.8.2 python-dateutil==2.4.2 pytz==2015.7 pyzmq==14.7.0 requests==2.8.1 scikit-learn==0.16.1 scipy==0.16.1 seaborn==0.6.0 simplegeneric==0.8.1 six==1.10.0 sklearn==0.0 sklearn-pandas==0.0.10 snowballstemmer==1.2.0 sphinx-rtd-theme==0.1.9 terminado==0.5 tornado==4.2.1 traitlets==4.0.0 wsgiref==0.1.2

gregory-marton commented 8 years ago

http://stackoverflow.com/questions/16015864/python-valueerror-operands-could-not-be-broadcast-together-with-shapes indicates that this is probably an error in our code that matplotlib 1.4.3 was just silently dealing with, while 1.5.0 actually checks and crashes.

gregory-marton commented 8 years ago
$ ./check.sh --pdb
E                   ValueError: operands could not be broadcast together with shapes (14,) (15,) (14,)

../../../pc/27/lib/python2.7/site-packages/numpy/lib/function_base.py:249: ValueError
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> /Users/probcomp/pc/27/lib/python2.7/site-packages/numpy/lib/function_base.py(249)histogram()
-> n += np.bincount(indices, weights=tmp_w, minlength=bins).astype(ntype)
(Pdb) bc = np.bincount(indices, weights=tmp_w, minlength=bins).astype(ntype)
(Pdb) p bc
array([163,  11,   8,   7,   8,   2,   0,   0,   3,   1,   1,   0,   1,
         0,   2])
(Pdb) p n
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
(Pdb) p len(n)
14
(Pdb) p len(bc)
15

So that's the problem. They have mismatched sizes. That's what the error complains about.

n was initialized as np.zeros(bins) and this is the first time through the loop.

(Pdb) p i
0
(Pdb) p bins
14.387494569938159
(Pdb) p len(np.zeros(bins))
14

But np.bincount returns something one longer than its input length: http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.bincount.html

So this is a coding error in numpy.histogram. What versions of numpy are we using? Jenkins is using 1.8.2, I'm using 1.10.1. Is this a regression in numpy? 1.8.2 uses linspace to make the right set of bins itself, so yes, perhaps. Interesting.

Leaving it here for tonight.

gregory-marton commented 8 years ago

packaging/jenkins now codifies a build environment that we should be using on our laptops too, and that we should recommend to our users. Jenkins uses that build environment, and in that environment, this is no longer an issue.

https://github.com/probcomp/packaging/commit/c851d0ecaa52e250ff6ac398dd029b4fea7bafd1