Closed khughitt closed 4 years ago
Hi @khughitt, you could make a command that looped over the columns and guessed the type for each one that is anytype
. Like (this is untested but it should basically work):
Sheet.addCommand('gz~', 'guess-types', 'for c in visibleCols: if c.type is anytype: c.type = c.guessType(visibleRows)')
@Column.api
def guessType(col, rows):
vals = [ col.getValue(r) for r in rows ]
try:
if int(vals[0]):
return int
...
except Exception:
return anytype
gz~
is probably the wrong keybinding (z~
already sets type to anytype
), but this should be a reasonable structure for a type-guessing plugin. Though I realize this still leaves you with the difficult task of figuring out the heuristics for column guessing, which I have been unable to do sufficiently to my satisfaction (and thus why VisiData does not try :). If you are willing to port or otherwise use the pandas code, I think it would be great to have it as a plugin!
@saulpw Thanks for the suggestion! I checked out the Pandas code for type inference, and it looks like it is written in Cython / wouldn't be so straight-forward to port.
Out of curiosity, have you considered using Pandas in visidata, and if so, did you rule it out simply to keep the dependencies minimal? It seems like it could be really useful to use as an internal data-model for visidata, considering that this is exactly the kind of data it is geared primarily towards and the large amount of effort and testing that have gone into the project. Also seems like it would open up visidata to some interesting plugin opportunities. Are there any scenarios where the extra overhead / dependencies would be prohibitive for visidata use?
There is a pandas loader, which is what we use to load some of the more obscure formats. @azjps did some good work to use pandas to compute frequency tables on pandas-based sheets.
Yes, I wanted to avoid the Pandas dependency for core VisiData (esp as it's quite large, and not pure python). However the main reason was that, while Pandas can do a lot, it also imposes a bunch of limitations. VisiData rows can be a collection of literally any Python object, and that's where a lot of its internal power and flexibility come from. If we based the internal data model on Pandas, then we couldn't easily make e.g. the Columns Sheet work, or in fact even make something like the Pandas wrapper (which at its core is quite simple).
@khughitt I can answer a few questions and share some thoughts!
We have an optional pandas loader. =) A loader is essentially a module which directs how VisiData structures and engages with a particular data source. The code is located here and here.
To load a file format which is supported by pandas, execute vd -f pandas data.foo
. This will call pandas.read_foo()
.
For example:
vd -f pandas data.parquet
loads a parquet file.
It is currently not working on the develop
branch, but you can check it out on stable
! And if you want to help us develop it further, we would definitely welcome the help! We want to give the pandas loader more love, but have not allocated time yet.
That being said, pandas is an optional dependency and the pandas loader is optional. I will let @saulpw share his personal feelings on what the 'data model' is in VisiData and why it was so important to him to keep mandatory dependencies to the minimal. You can flip through some currently available sections of the book for a hint of his design philosophies.
For me personally as someone who packages VisiData and keeps an eye on its dependencies, pandas has given me quite a bit of heartache, I am relieved that it is not a mandatory dependency. It is absolutely wonderful as a scripting tool, but integrating it into an application is tricky. It is not 1.0, so its API is not stable, and I have been procrastinating figuring out which pandas version range we support for weeks. It also has NumPy as a mandatory dependency and NumPy ... has also given me difficulty. I ran into quite a few 'binary incompatibility' issues with various ranges of NumPy versions.
You can also check out the pandas label in our issues for the specific issues we had with figuring out pandas relationship with VisiData (and please help us out with them if you thoughts)!
@saulpw @anjakefala Thank you both for taking the time to share your thoughts on visidata and its relation to pandas. I think those are all completely valid reasons to want to keep it as an optional dependency.
I was also unaware of the loader functionality visidata - that's very useful! I think I'll have to spend some time going through the manual / book to get a better appreciation for the full scope and aims of the project.
Let me just say that Visidata is probably my favorite new software that I've come across over the past year or so, and I'm continually impressed with just how much it can do in a few keystrokes!
Rather than commit to anything up front, think what I will try and do is spend a bit more time looking at the code each time I come across an issue in the future. That way I can familiarize myself with the codebase some. Hopefully at some point in the future then, I can start to contribute more.
For now, I'll leave this issue open to keep track of it / so others can comment if they wish to. Eventually (if no one else has already tackled it and it is still something that you would be interested in incorporating), I'll see if I can take a stab at it.
Hi @khughitt, we're cleaning our issues list and I'm closing all feature requests. I'd love to see this as a plugin though! Email me or come join us on #visidata on IRC freenode.
I also needed this functionality, I am still stuck on visidata 1.5.2.
This is what I came up with:
Sheet.addCommand('gr', 'guess-types', 'for c in visibleCols: c.type = guessTypeImp(c, visibleRows)')
def guessTypeImp(col, rows):
if col.type is not anytype:
return col.type
max_to_check = 20
curr_type = None
for val in itertools.islice((col.getValue(r) for r in rows), max_to_check):
if not str(val):
continue
try:
fv = float(val)
is_int = float(int(fv)) == fv
if is_int:
if curr_type is None:
curr_type = int
else:
curr_type = float
except Exception:
pass
if curr_type is None:
return anytype
else:
return curr_type
This works well enough for me because I am only trying to guess int
and floats
correctly I don't care about the other types.
The way I use this is I type gr
when I open visidata and after that I type g_
to adjust the with of all colomns.
I tried to following
Sheet.addCommand('gr', 'guess-types', 'for c in visibleCols: c.type = guessTypeImp(c, visibleRows)'; resize-cols-max)
to be able to guess the types and do the resizing at the same time but this doesn't work. Any idea how I can make this work as well with just one keystroke?
Thanks!
As I think about it it would be even better to be able to call these every time I open visidata even without having to call the keystroke, or I would like to try that for a while and see if I like that workflow. Is this possible?
Thank you again.
Hi @skeept, thanks for the code snippet! As to your issue, in order to run a command, you'd use execCommand("resize-cols-max")
inside the execstr. You could also inline the execstr from resize-cols-max in your code, but since it's hard to do that with a single string, you probably want to factor that out into another function and have the execstr call that.
Unfortunately, while you can run sheet-less commands at the beginning of a session with --preplay
, there's no way to run commands automatically at startup. There is probably some creative and/or hacky solution but I can't think of it off the top of my head. It's probably a good candidate for a wishlist item. Thanks for the suggestion!
Hi @saulpw thanks for the suggestion. Using excecCommand("resize-cols-max")
actually seems like a nice thing.
But I tried with
Sheet.addCommand('gr', 'guess-types', 'for c in visibleCols: c.type = guessType(c, visibleRows); execCommand("resize-cols-max")')
but got the following error:
NameError: name 'execCommand' is not defined
Any idea how I can solve this?
This is on 2.x. I think it is exec_command
or exec_keystrokes
on 1.5.2 (but may not work quite the same way). Also I don't think you can have a one-liner for-loop like that? But if it works it works.
Thanks @saulpw .
exec_keystrokes
works. Bellow is the code I ended up with. Look forward for release of 2.x
.
# pre-2.0: Sheet.addCommand('gr', 'guess-types', 'guessType(visibleCols, visibleRows); exec_keystrokes("resize-cols-max")')
Sheet.addCommand('gr', 'guess-types', 'guessType(visibleCols, visibleRows); execCommand("resize-cols-max")')
def guessType(cols, rows):
for col in cols:
if col.type is not anytype:
continue
max_to_check = 20
current_type = None
for val in itertools.islice((col.getValue(r) for r in rows), max_to_check):
if not str(val):
continue
try:
fv = float(val)
is_int = float(int(fv)) == fv
if is_int and current_type is None:
current_type = int
else:
current_type = float
break
except Exception:
pass
col.type = anytype if current_type is None else current_type
Here's the latest version of my autotune
, which includes, among other things, column-type guessing: https://gist.github.com/cool-RR/47d8ebb7afc636a995fccf889aada8e9
I've got code in my .visidatarc
that automatically invokes it on all sheets so I very rarely have to deal with untyped columns.
This feature is so important. If I'm showing VisiData to people and they don't have my .visidatarc
, I'd be embarrassed that they don't get their column types guessed automatically.
Small description
Although I see that visidata intentionally does not guess at column types, I find this functionality (e.g. in pandas/readr/etc.) quite convenient, even if it's not 100% perfect.
Is there any possibility that such functionality could be added, perhaps disabled by default if you wish to preserve the current approach? Although pandas is not a dependency, the logic for such type guessing could still be borrowed from their codebase to save time.