harelba / q

q - Run SQL directly on delimited files and multi-file sqlite databases
http://harelba.github.io/q/
GNU General Public License v3.0
10.19k stars 421 forks source link

Add regex replace custom function #270

Open JamezQ opened 3 years ago

JamezQ commented 3 years ago

This feature allows q to quickly get a new column based on a regex match of another column. And even execute a GROUP BY using this new column.

Example: Find the disk usage by extension for the current directory:

du -sk *|grep "\."|q -b -t 'SELECT resub(c2,".*\.","") as ex,SUM(c1) FROM - GROUP BY ex ORDER BY SUM(c1)'

This is quite nice compared to other answers in: https://unix.stackexchange.com/questions/308846/how-to-find-total-filesize-grouped-by-extension (Slightly modified for only looking in the current dir)

find . -type f -depth 1 | egrep -o "\.[a-zA-Z0-9]+$" | sort -u|xargs -I '%' find . -type f -name "*%" -depth 1 -exec du -ch {} + -exec echo % \;|egrep "^\.[a-zA-Z0-9]+$|total$"|uniq|paste - -

find . -name '?*.*' -type f -depth 1 -print0 |                                                                                                                                                  
  perl -0ne '
    if (@s = stat$_){
      ($ext = $_) =~ s/.*\.//s;
      $s{$ext} += $s[12];
      $n{$ext}++;
    }
    END {
      for (sort{$s{$a} <=> $s{$b}} keys %s) {
        printf "%15d %4d %s\n",  $s{$_}<<9, $n{$_}, $_;
      }
    }'

Of course, it is possible to use q without this feature quite succinctly as well:

du -sk *|grep "\."|sed 's/\t.*\.\(.*\)$/ \1/g'|q -b 'SELECT c2,SUM(c1) FROM - GROUP BY c2 ORDER BY SUM(c1)'

However, as the capture groups become more complex, getting multiple captures from the same column, etc, I believe resub becomes much more useful.

As another example, I have already used this modified q to analyze a log file containing user agents, and creating a summary of the most used OS/DEVICE combination. (Simply group by both columns and count) This involved two extractions from the same field to acquire the OS and device. (Using capture groups feature of re.sub)

Finally, resub gets to utilize the existing row and column parsing q has, this can be complicated if using sed (commas in quoted fields).

JamezQ commented 3 years ago

This passes make test.

harelba commented 2 years ago

Hi,

this is a good addition, although I'm aiming to add a full set of regexp functions, including optimizing the pattern compilation. Using re.sub or any other re method directly compiles the pattern on every call to the function, which is extremely slow.

Regarding filename handling, I believe that it's important enough to have its own functions. regexp parsing of filenames can be error prone, and lead to inconsistencies, such as what happens if there is no extension for a filename.

Now that q 3.1.0-beta is merged into master, it's possible to create new PRs. I've created a PR (#282) for adding multiple requested user functions, and added several filename-parsing functions. Achieving the per-ext total sizes would then be something like that:

$ du -ks * | q -b -t "select file_ext(c2),sum(c1) from - group by 1 order by 2"
.bzl        4
.ini        4
.markdown   8
.py         8
.sh         12
.txt        12
.egg-info   24
.md         28
.alal       29948
.qsql       44068
            288636

I'll add the set of regexp functions on that PR as well.