harelba / q

q - Run SQL directly on delimited files and multi-file sqlite databases
http://harelba.github.io/q/
GNU General Public License v3.0
10.15k stars 419 forks source link

Documentation Request: Please list encodings supported #248

Open contentfree opened 3 years ago

contentfree commented 3 years ago

Please add documentation about the encodings supported. I tried iso-8859 and was given Encoding iso-8859 could not be found. What is the list of encodings?

harelba commented 3 years ago

The encodings supported are all the encodings that python support, but I agree, q has to be able to show which ones are available.

I'll add it to the next release.

In the mean time, here's a list of encodings that are supported:

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp273',
 'cp424',
 'cp437',
 'cp500',
 'cp720',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp858',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1125',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'cp65001',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_11',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_t',
 'koi8_u',
 'kz1048',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']
pkoppstein commented 3 years ago

In the section on encoding, the online q documentation page advises:

Please contact me if you encounter any issues and I'd be glad to help.

Since the issues I ran into are closely related to this particular Issue, I thought it might be better to add a comment rather than create a new issue.

In a nutshell, the following boils down to three requests:

(1) Documentation clarification:

The documentation for -e says:

"Input file encoding. Defaults to UTF-8. set to none for not setting any encoding - faster, but at your own risk..."

So I tried:

q -e none

but got: LookupError('unknown encoding: none')

Similarly, using q --encoding= yields

"Encoding could not be found"

(2) Addition of "ISO-8859" as a supported encoding.

The reason is simply that file reports "ISO-8859" in some cases, but using -e ISO-8859, q complains:

Encoding ISO-8859 could not be found

(3) More apt error message

With my ISO-8859 file, running q without any encoding option yields the not-so-helpful message:

AttributeError("'UnicodeDecodeError' object has no attribute 'message'")

Thanks!

.

harelba commented 3 years ago

These are interesting issues @pkoppstein . thanks for reporting them.

I'll answer one by one:

  1. Which python version are you using? I believe that the none encoding is a leftover from python 2, and is not supported. I'll add some changes to reflect that, both in the code and in the docs.
  2. It seems that the python encoding for iso-8859 is iso-8859-1. I'll add some code to alias it, could be helpful for others as well.
  3. This is obviously a bug. I'll open a new ticket for this and fix this.

As i'm in the middle of a huge change in q, i'll try to add the non-trivial parts to it, but if something proves to be more elaborate, i'll fix it in the next release.

pkoppstein commented 3 years ago

Which python version are you using?

$ q -v
q version 2.0.19
Python: 3.9.4 (default, Apr  5 2021, 01:47:16) // [Clang 11.0.0 (clang-1100.0.33.17)]

iso-8859-1

Yes, q accepts iso8859 and iso-8859-1, but sincefile reports ISO-8859, "Postel's Law" would be applicable :-)

pkoppstein commented 3 years ago

i'm in the middle of a huge change in q

This sounds like very good news, but I am also wondering whether your evidently enormous talents aren't slightly misdirected. My concerns are inter-related but it might be useful to separate them:

1) q is built on feet of clay (python); as long as that is true, it will mean sqlite3 (and perhaps other alternatives) will be significantly faster.

2) sqlite3 seems to do just about everything that q can do, but faster, and of course, with support for database updates, and much more.

3) the q documentation mention that it is an "in-memory" database, which is a potentially significant concern, especially in this age of "big data".

No doubt your work in progress will address some of these issues, but that raises another point -- "long-term support". Given that sqlite3 is bundled with everything from Android to Windows10, and that the SQLite format is a "Recommended Format" of the Library of Congress, sqlite's long-term existence seems on reasonably firm foundatiaons.

It seems to me that one important weakness of sqlite3 (the program, not the data format) is that it's written in C. Have you given any thought to rewriting it in Rust? See e.g. https://github.com/joaoh82/rust_sqlite

bitti commented 3 years ago

Isn't that critique rather misdirected? To my understanding q is build on sqlite3, the python part is "merely" a wrapper. At the same time q doesn't try to do everything sqlite3 does, because q is about textfiles, but being based on sqlite3 it will be "in-memory" by nature (even if it would add a feature to persist it to a file). Of course one shouldn't process TB of data with q, but I found it a pretty good post processing tool for mapreduce results. Use big data tools for big data, use q and friends for smaller things like postprocessing.

pkoppstein commented 3 years ago

@bitti - Your remarks strongly suggest you are not sufficiently familiar with the sqlite3 "command line shell" (https://sqlite.org/cli.html), and in particular how (extremely) easy it is to use to process CSV/TSV files.

Suffice it to say here that if q's implementation were based on the sqlite3 implementation, (a) it would not have the (nicely documented) severe restrictions that it does have; (b) it would not be so much slower than sqlite3 (see below), and (c) it would surely not require a major effort for q to support a disk-based store.

Regarding (b), here are some comparable "u+s" numbers for a simple query against a modestly large TSV file:

      q 1.38s
sqlite3 0.85s
awk     0.13s

The awk timing is obviously not based on an actual SQL query.

harelba commented 3 years ago

@pkoppstein Thanks for your comments and kind words. I'll share my view on this here.

q's goal from my perspective is to provide a "product" layer above the amazing sqlite database. As such, it tries to provide ease-of-use, seamless integration with other command line tools and concepts, and freeing the user from dealing with data loading and schema management. It is not meant to replace distributed query engines for big data, which are awesome, but usually require a lot of setup and tuning in order to provide consistent value.

There are obviously many cases where using sqlite directly might be more helpful and useful (as is using other command line tools). However, in many cases, dealing with data loading and data types is cumbersome and does not provide a fully-seamless experience.

As to my current efforts, I don't want to spoil the surprise, but they are strategic and directly related to bridging the q/sqlite gap and the concerns you are mentioning. The effort is mostly done, and will probably be out in 2-3 weeks, and I believe it could be a game-changer in many aspects.

pkoppstein commented 3 years ago

@harelba - I'm glad you did not misunderstand my remarks, though I am now doubly concerned as it seems you too might not fully appreciate the brilliance of sqlite3. Let me therefore illustrate how easy it is to use the sqlite3 shell to execute an arbitrary SQL command supported by sqlite3, against a CSV file with headers (say, "v1,v2,...."):

$ sqlite3 <<< '
.mode csv
.import myfile.csv test
select count(*) from test where v1 = "MD2003L031"
'

It would obviously be trivial to write a simple (bash) wrapper to enable one to run an arbitrary query at the command line in the manner of q, essentially by prepending the necessary sqlite3 directives. E.g., to avoid any memory issues, you could just add an .open directive before .import.

Perhaps something like q's support for multiple encodings simultaneously gives it a niche, but I would nevertheless venture that building such added convenience on top of sqlite3 would be vastly simpler, significantly faster, and far more maintainable.

In short - Am I missing something?

harelba commented 3 years ago

As I mentioned above, I believe sqlite is an amazing database, and that includes its CLI. I haven't made a comprehensive comparison between them, but I believe each has its pros and cons, based on the use-case at hand.

Regardless, as my current effort is directly related to that subject, I would be very glad to discuss this further and get your view on this once I release the new version.