tommikaikkonen / prettyprinter

Syntax-highlighting, declarative and composable pretty printer for Python 3.5+
https://prettyprinter.readthedocs.io
MIT License
337 stars 20 forks source link

Further improve numpy array prettyprinting #49

Open anntzer opened 5 years ago

anntzer commented 5 years ago

Description

In #47 I added prettyprinting for numpy arrays essentially by converting them to nested lists, but this is unsatisfactory for multidimensional arrays. Indeed, numpy's array repr inserts additional spaces in order to align the elements, greatly improving legibility.

In [1]: np.random.rand(3, 3) * 100
Out[1]:
array([[63.0423951 ,  7.07847322, 66.5850687 ],
       [71.59167357, 70.52075727, 48.3925865 ],
       [43.17660142, 87.91482751, 99.78392189]])

Compare with prettyprinter's current output:

In [4]: np.random.rand(3, 3) * 100                                                                                   
Out[4]: 
numpy.ndarray([
    [91.90306563557236, 3.4641894460186617, 87.67287734220052],
    [50.65858623035475, 8.228111580187969, 49.72967683713879],
    [37.013348107747646, 54.53515827530756, 4.736265881274138]
])

(note the alignment of the second and third columns; the difference in float precision comes from the np.set_printoptions(precision=...) setting which is also lost after conversion to nested lists).

I think a better approach would be to use numpy's array repr, strip out the leading "array(" and trailing ")", and insert the rest of the list "as is", with all spaces, into prettyprinter's machinery (this would, of course, allow one to benefit from prettyprinter's logic when printing arrays nested in other values, etc.). The best I could come up with so far relies on an intermediate wrapper class:

diff --git i/prettyprinter/extras/numpy.py w/prettyprinter/extras/numpy.py
index 1768afe..d8bc179 100644
--- i/prettyprinter/extras/numpy.py
+++ w/prettyprinter/extras/numpy.py
@@ -10,6 +10,11 @@ from ..prettyprinter import (
 )

+class _ArrayWrapper:
+    def __init__(self, array):
+        self.array = array
+
+
 def pretty_ndarray(value, ctx):
     import numpy as np
     # numpy 1.14 added dtype_is_implied.
@@ -19,7 +24,7 @@ def pretty_ndarray(value, ctx):
         # Masked arrays, in particular, require their own logic.
         return repr(value)
     from numpy.core import arrayprint
-    args = (value.tolist(),)
+    args = (_ArrayWrapper(value),)
     kwargs = []
     dtype = value.dtype
     # This logic is extracted from arrayprint._array_repr_implementation.
@@ -31,6 +36,10 @@ def pretty_ndarray(value, ctx):
     return pretty_call_alt(ctx, type(value), args, kwargs)

+def pretty_arraywrapper(value, ctx):
+    return repr(value.array)[6:-1]  # strip out "array(" and ")", could be made more robust
+
+
 def install():
     register_pretty("numpy.bool_")(pretty_bool)

@@ -46,3 +55,5 @@ def install():
         register_pretty("numpy." + name)(pretty_float)

     register_pretty("numpy.ndarray")(pretty_ndarray)
+    register_pretty("prettyprinter.extras.numpy._ArrayWrapper")(
+        pretty_arraywrapper)

but the indentation is wrong:

In [4]: np.random.rand(3, 3) * 100
Out[4]: 
numpy.ndarray(
    [[72.42269795, 40.88814098, 57.71100553],
       [53.82635482, 69.02850785, 18.03692554],
       [28.17979508, 76.66972263, 71.8467079 ]]
)

Do you have a better approach to suggest, or any hints on properly inserting a literal string repr into prettyprinter?

Thanks in advance.

tommikaikkonen commented 5 years ago

This is one case that the Wadler-Leijen layout algorithm does not handle well - here's an excerpt from the docs of a prettyprinter library in Haskell that uses the same layout algorithm:

prettyprinter-algorithm2

If this were implemented, it would have to be special cased to values for which we know the printed width easily without doing the actual rendering, such as the numbers here (len(repr(1.23)) is pretty cheap). That would allow the pretty printer defined in the extra to manually calculate the spaces needed to align the columns.

Taking the repr of the value from numpy could work, but it would also make the repr'd part uncolored, because colored output is achieved in this library by annotating layout primitives with the kind of syntax element it represents (and then applying a theme to decide the final color to output). pretty_call and other functions provided by the library do that automatically, but a repr string does not know which parts of the string should be displayed as number colors and which ones as punctuation (commas), etc. This is why I'd like to avoid that workaround in the numpy extra in this lib.

But I do think this should be possible to implement for this specific context (matrices of bools/numbers). Rendering matrices will have to ignore the main thing the layout algorithm in this library optimizes for, maintaining a maximum line-length, but for matrices the alignment is more useful and a more important goal. So it would also produce a prettier output. But the implementation does need a bit of knowledge about the internal layout primitives to do properly--I'll keep this feature in mind next time I spend time hacking on this lib :)

In the meantime, it probably makes sense for you to override the matrix prettyprinter in your project to use the workaround you're using now. I believe you can fix the indentation issue by making sure that the first line of the string you return includes the same amount of leading indentation whitespace as the remaining lines.

anntzer commented 5 years ago

Thanks for the detailed writeup, I got to some reasonably working version in #50. Personally, I'm not too bothered by the lack of coloring for the numbers -- for a large "field" of identically-typed numbers, coloring is less relevant (as opposed to complex, non-uniform constructs, for which coloring is very useful); what I appreciate the most from prettyprinter is that it will manage to keep the repr properly indented even when the array is part of a larger structure (which requires additional indent levels).