saulpw / visidata

A terminal spreadsheet multitool for discovering and arranging data
http://visidata.org
GNU General Public License v3.0
7.89k stars 280 forks source link

Add option to guess at column types #349

Closed khughitt closed 4 years ago

khughitt commented 5 years ago

Small description

Although I see that visidata intentionally does not guess at column types, I find this functionality (e.g. in pandas/readr/etc.) quite convenient, even if it's not 100% perfect.

Is there any possibility that such functionality could be added, perhaps disabled by default if you wish to preserve the current approach? Although pandas is not a dependency, the logic for such type guessing could still be borrowed from their codebase to save time.

saulpw commented 5 years ago

Hi @khughitt, you could make a command that looped over the columns and guessed the type for each one that is anytype. Like (this is untested but it should basically work):

Sheet.addCommand('gz~', 'guess-types', 'for c in visibleCols: if c.type is anytype: c.type = c.guessType(visibleRows)')

@Column.api
def guessType(col, rows):
    vals = [ col.getValue(r) for r in rows ]
    try:
        if int(vals[0]):
            return int
        ...
    except Exception:
        return anytype

gz~ is probably the wrong keybinding (z~ already sets type to anytype), but this should be a reasonable structure for a type-guessing plugin. Though I realize this still leaves you with the difficult task of figuring out the heuristics for column guessing, which I have been unable to do sufficiently to my satisfaction (and thus why VisiData does not try :). If you are willing to port or otherwise use the pandas code, I think it would be great to have it as a plugin!

khughitt commented 5 years ago

@saulpw Thanks for the suggestion! I checked out the Pandas code for type inference, and it looks like it is written in Cython / wouldn't be so straight-forward to port.

Out of curiosity, have you considered using Pandas in visidata, and if so, did you rule it out simply to keep the dependencies minimal? It seems like it could be really useful to use as an internal data-model for visidata, considering that this is exactly the kind of data it is geared primarily towards and the large amount of effort and testing that have gone into the project. Also seems like it would open up visidata to some interesting plugin opportunities. Are there any scenarios where the extra overhead / dependencies would be prohibitive for visidata use?

saulpw commented 5 years ago

There is a pandas loader, which is what we use to load some of the more obscure formats. @azjps did some good work to use pandas to compute frequency tables on pandas-based sheets.

Yes, I wanted to avoid the Pandas dependency for core VisiData (esp as it's quite large, and not pure python). However the main reason was that, while Pandas can do a lot, it also imposes a bunch of limitations. VisiData rows can be a collection of literally any Python object, and that's where a lot of its internal power and flexibility come from. If we based the internal data model on Pandas, then we couldn't easily make e.g. the Columns Sheet work, or in fact even make something like the Pandas wrapper (which at its core is quite simple).

anjakefala commented 5 years ago

@khughitt I can answer a few questions and share some thoughts!

We have an optional pandas loader. =) A loader is essentially a module which directs how VisiData structures and engages with a particular data source. The code is located here and here.

To load a file format which is supported by pandas, execute vd -f pandas data.foo. This will call pandas.read_foo().

For example:

vd -f pandas data.parquet

loads a parquet file.

It is currently not working on the develop branch, but you can check it out on stable! And if you want to help us develop it further, we would definitely welcome the help! We want to give the pandas loader more love, but have not allocated time yet.


That being said, pandas is an optional dependency and the pandas loader is optional. I will let @saulpw share his personal feelings on what the 'data model' is in VisiData and why it was so important to him to keep mandatory dependencies to the minimal. You can flip through some currently available sections of the book for a hint of his design philosophies.

For me personally as someone who packages VisiData and keeps an eye on its dependencies, pandas has given me quite a bit of heartache, I am relieved that it is not a mandatory dependency. It is absolutely wonderful as a scripting tool, but integrating it into an application is tricky. It is not 1.0, so its API is not stable, and I have been procrastinating figuring out which pandas version range we support for weeks. It also has NumPy as a mandatory dependency and NumPy ... has also given me difficulty. I ran into quite a few 'binary incompatibility' issues with various ranges of NumPy versions.

You can also check out the pandas label in our issues for the specific issues we had with figuring out pandas relationship with VisiData (and please help us out with them if you thoughts)!

khughitt commented 5 years ago

@saulpw @anjakefala Thank you both for taking the time to share your thoughts on visidata and its relation to pandas. I think those are all completely valid reasons to want to keep it as an optional dependency.

I was also unaware of the loader functionality visidata - that's very useful! I think I'll have to spend some time going through the manual / book to get a better appreciation for the full scope and aims of the project.

Let me just say that Visidata is probably my favorite new software that I've come across over the past year or so, and I'm continually impressed with just how much it can do in a few keystrokes!

Rather than commit to anything up front, think what I will try and do is spend a bit more time looking at the code each time I come across an issue in the future. That way I can familiarize myself with the codebase some. Hopefully at some point in the future then, I can start to contribute more.

For now, I'll leave this issue open to keep track of it / so others can comment if they wish to. Eventually (if no one else has already tackled it and it is still something that you would be interested in incorporating), I'll see if I can take a stab at it.

saulpw commented 4 years ago

Hi @khughitt, we're cleaning our issues list and I'm closing all feature requests. I'd love to see this as a plugin though! Email me or come join us on #visidata on IRC freenode.

skeept commented 4 years ago

I also needed this functionality, I am still stuck on visidata 1.5.2.

This is what I came up with:

Sheet.addCommand('gr', 'guess-types', 'for c in visibleCols: c.type = guessTypeImp(c, visibleRows)')

def guessTypeImp(col, rows):
    if col.type is not anytype:
        return col.type

    max_to_check = 20
    curr_type = None
    for val in itertools.islice((col.getValue(r) for r in rows), max_to_check):
        if not str(val):
            continue
        try:
            fv = float(val)
            is_int = float(int(fv)) == fv
            if is_int:
                if curr_type is None:
                    curr_type = int
            else:
                curr_type = float
        except Exception:
            pass

    if curr_type is None:
        return anytype
    else:
        return curr_type

This works well enough for me because I am only trying to guess int and floats correctly I don't care about the other types.

The way I use this is I type gr when I open visidata and after that I type g_ to adjust the with of all colomns. I tried to following

Sheet.addCommand('gr', 'guess-types', 'for c in visibleCols: c.type = guessTypeImp(c, visibleRows)'; resize-cols-max)

to be able to guess the types and do the resizing at the same time but this doesn't work. Any idea how I can make this work as well with just one keystroke?

Thanks!

skeept commented 4 years ago

As I think about it it would be even better to be able to call these every time I open visidata even without having to call the keystroke, or I would like to try that for a while and see if I like that workflow. Is this possible?

Thank you again.

saulpw commented 4 years ago

Hi @skeept, thanks for the code snippet! As to your issue, in order to run a command, you'd use execCommand("resize-cols-max") inside the execstr. You could also inline the execstr from resize-cols-max in your code, but since it's hard to do that with a single string, you probably want to factor that out into another function and have the execstr call that.

Unfortunately, while you can run sheet-less commands at the beginning of a session with --preplay, there's no way to run commands automatically at startup. There is probably some creative and/or hacky solution but I can't think of it off the top of my head. It's probably a good candidate for a wishlist item. Thanks for the suggestion!

skeept commented 4 years ago

Hi @saulpw thanks for the suggestion. Using excecCommand("resize-cols-max") actually seems like a nice thing. But I tried with


Sheet.addCommand('gr', 'guess-types', 'for c in visibleCols: c.type = guessType(c, visibleRows); execCommand("resize-cols-max")')

but got the following error:

NameError: name 'execCommand' is not defined

Any idea how I can solve this?

saulpw commented 4 years ago

This is on 2.x. I think it is exec_command or exec_keystrokes on 1.5.2 (but may not work quite the same way). Also I don't think you can have a one-liner for-loop like that? But if it works it works.

skeept commented 4 years ago

Thanks @saulpw .

exec_keystrokes works. Bellow is the code I ended up with. Look forward for release of 2.x.

# pre-2.0: Sheet.addCommand('gr', 'guess-types', 'guessType(visibleCols, visibleRows); exec_keystrokes("resize-cols-max")')
Sheet.addCommand('gr', 'guess-types', 'guessType(visibleCols, visibleRows); execCommand("resize-cols-max")')

def guessType(cols, rows):
    for col in cols:
        if col.type is not anytype:
            continue

        max_to_check = 20
        current_type = None

        for val in itertools.islice((col.getValue(r) for r in rows), max_to_check):
            if not str(val):
                continue
            try:
                fv = float(val)
                is_int = float(int(fv)) == fv
                if is_int and current_type is None:
                    current_type = int
                else:
                    current_type = float
                    break
            except Exception:
                pass
        col.type = anytype if current_type is None else current_type
cool-RR commented 1 year ago

Here's the latest version of my autotune, which includes, among other things, column-type guessing: https://gist.github.com/cool-RR/47d8ebb7afc636a995fccf889aada8e9

I've got code in my .visidatarc that automatically invokes it on all sheets so I very rarely have to deal with untyped columns.

This feature is so important. If I'm showing VisiData to people and they don't have my .visidatarc, I'd be embarrassed that they don't get their column types guessed automatically.