saulpw / visidata

A terminal spreadsheet multitool for discovering and arranging data
http://visidata.org
GNU General Public License v3.0
7.87k stars 279 forks source link

Review how we configure of which format to save in #2286

Open reagle opened 8 months ago

reagle commented 8 months ago

Two stories about how I could use more guidance or guard rails when saving work. Presently, I have to look up and refer to the supported formats, and then my choices often don't work.

Usenet

VisiData helped me find Elizabeth Edwards' (famous) participation on Usenet's alt.support.grief; vd can read Internet Archives mbox format and make quick work of searching.

Saving the derivative sheet is tricky though. vd defaults to tsv (even if I give the mbox extension), but there's is no mbox save support, so I don't know what the resulting file format is anymore and I don't think vd does either when I return to the file. (I can save to csv, which is okay, but the result has some odd character conversions.)

Reddit

I'm analyzing posts on a subreddit which are in a "zstandard compressed ndjson" file. vd opens it well, but after some manipulations, I want to save it so I can return to the data as is, so vds seems like a natural format. And it works! However, I think, why not save it as compressed, and the resulting file BestofRedditorUpdates_submissions.vds.zst is smaller, but cannot be reopened: "Unsupported operation: Underlying stream is not seakable."

-rw-r--r-- 1 reagle staff  77M Feb  1 16:09 BestofRedditorUpdates_submissions.vds
-rw-r--r-- 1 reagle staff  17M Feb  1 16:09 BestofRedditorUpdates_submissions.vds.zst
-rw-r--r-- 1 reagle staff  14M Feb  1 15:45 BestofRedditorUpdates_submissions.zst

Consequently, relying on the file extension is problematic outside of the simplest cases because:

  1. When saving, the file extension might not trigger a format (e.g., supported on read but not supported on write).
  2. It's not clear if multiple extensions work (i.e., format+compression).
  3. In the moment, I'm not sure what formats are available.
midichef commented 8 months ago

To answer your first question, I went through and looked in the source for def open_*() and def save_*() Here's a list of every extension/filetype that visidata can read+write:

arrow         gsheets       npy           tsv           xd
arrows        html          org           txt           xls
csv           jrnl          parquet       usv           xlsx
dta           jsonl         png           vdj           xml
fixed         jsonla        rec           vds           zip
geojson       lsv           sqlite        vdx

And here's what it can read, but not write:

airtable      forg          mh            pdf           toml
babyl         frictionless  mmdf          puz           ttf
bytes         gdrive        mnu           pyprof        vcf
conll         git           npz           reddit        vd
conllu        h5            ods           sas7bdat      xlsb
eml           jsonobj       orgdir        scrape        xpt
f5log         maildir       pandas        shp           yml
fdir          mbox          pbf           spss          zulip
fec           mbtiles       pcap          tar

And there seem to be a few it can write but not read: dot svg. And there are several extensions that are not exactly full-fledged file types, they are types of tables in the tabulate library, for table files (see loaders/texttable.py. The list of these includes jira md table (and more) that it can write, but not read.

Where is a good place you'd like to see this information? It could go in a table like https://visidata.org/docs/formats/, perhaps in one of the guides? Accessible by a command like open-format-guide?

saulpw commented 8 months ago

It's not clear if multiple extensions work (i.e., format+compression).

This does not work currently, but it's on my wishlist too. I'd be interested in a PR that addressed this.

saulpw commented 8 months ago

the resulting file BestofRedditorUpdates_submissions.vds.zst is smaller, but cannot be reopened: "Unsupported operation: Underlying stream is not seakable."

Can the file be decompressed manually and then opened as .vds? If so, then it's likely a bug in the vds loader (otherwise it's a bug in the vds saver). This is a bug either way though.

Also I would support vdz as an alias for vds+zstd when that's possible.

reagle commented 8 months ago

I'm not sure what the best way to do this is, but some thoughts:

reagle commented 8 months ago

@saulpw I loaded the vds file, saved it as BestofRedditorUpdates_submissions.vds.zstd, which is a smaller file size, but am unable to decompress manually.

❯ zstd --decompress BestofRedditorUpdates_submissions.vds.zstd
zstd: BestofRedditorUpdates_submissions.vds already exists; overwrite (y/n) ? y
zstd: BestofRedditorUpdates_submissions.vds.zstd: unsupported format