lmorg / murex

A smarter shell and scripting environment with advanced features designed for usability, safety and productivity (eg smarter DevOps tooling)
https://murex.rocks
GNU General Public License v2.0
1.45k stars 27 forks source link

String manipulation builtins #670

Open orefalo opened 1 year ago

orefalo commented 1 year ago

Describe the problem: Murex needs built-in string manipulation functions.

Why? Because no Unix system is the same, and having built-ins is not only fast and convenient but it also ensures compatibility.

Documentation: For reference, here are the sting primitives from fish https://fishshell.com/docs/current/cmds/string.html

lmorg commented 1 year ago

Was there any specific ones you've missed? I ask because some of these exist already:

string collect [-a | --allow-empty] [-N | --no-trim-newlines] [STRING ...]

That's Murex's default behaviour

string escape [-n | --no-quoted] [--style=] [STRING ...]
string unescape [--style=] [STRING ...]
» builtins -> regexp m/^!?esc/
[
    "!escape",
    "!eschtml",
    "!escurl",
    "escape",
    "esccli",
    "eschtml",
    "escurl"
]

Escaping are the ones without the bang prefix, whereas unescaping are the ones with the bang.

string join [-q | --quiet] [-n | --no-empty] SEP [STRING ...]
string join0 [-q | --quiet] [STRING ...]

This doesn't currently exist but it's a good suggestion

string length [-q | --quiet] [STRING ...]
» %[one two three]
[
    "one",
    "two",
    "three"
]

» %[one two three] -> count
3

(There's also an alias len, for backwards compatibility with much older versions of Murex)

string lower [-q | --quiet] [STRING ...]
string upper [-q | --quiet] [STRING ...]

These don't currently exist but it's a good suggestion.

string match [-a | --all] [-e | --entire] [-i | --ignore-case]
             [-g | --groups-only] [-r | --regex] [-n | --index]
             [-q | --quiet] [-v | --invert]
             PATTERN [STRING ...]
string pad [-r | --right] [(-c | --char) CHAR] [(-w | --width) INTEGER]
           [STRING ...]
string repeat [(-n | --count) COUNT] [(-m | --max) MAX] [-N | --no-newline]
             [-q | --quiet] [STRING ...]

These doesn't currently exist but they're a good suggestion too

string replace [-a | --all] [-f | --filter] [-i | --ignore-case]
               [-r | --regex] [-q | --quiet] PATTERN REPLACE [STRING ...]
string shorten [(-c | --char) CHARS] [(-m | --max) INTEGER]
               [-N | --no-newline] [-l | --left] [-q | --quiet] [STRING ...]

right kind of does this. It doesn't add the ellipsis nor check for wide characters though. Maybe there is a case for flag that would check character width instead of number of characters?

string split [(-f | --fields) FIELDS] [(-m | --max) MAX] [-n | --no-empty]
             [-q | --quiet] [-r | --right] SEP [STRING ...]
string split0 [(-f | --fields) FIELDS] [(-m | --max) MAX] [-n | --no-empty]
              [-q | --quiet] [-r | --right] [STRING ...]

jsplit does this. It's not as feature rich as this but it supports regexp. Murex's type system also negates some of the need for manual splitting. Also regexp with the f flag could work here too.

string sub [(-s | --start) START] [(-e | --end) END] [(-l | --length) LENGTH]
           [-q | --quiet] [STRING ...]

left and right are supposed to solve this. However if you want something midway through a string then you have to pipe one into the other...which is a tad verbose :(

I had thought about creating another data-type called bytes which would basically be a byte array. That way you could index and range over the bytes with []. But it wasn't something I certain of implementing because it might lead some people to think it was a performant way of handling strings (like in C-like languages) whereas it could actually be a lot slower due to the way how Murex generally expects higher level abstracts. It's something to consider still though.

string trim [-l | --left] [-r | --right] [(-c | --chars) CHARS]
            [-q | --quiet] [STRING ...]

This doesn't exist verbatim but regexp 's/^\s+/' and regexp 's/\s+$/' would work in the meantime. I had given some thought in the past about when to trim things and when not to so it's a little weird I never thought to add this myself.


I also need to give some thought about how to make these builtins better discoverable and thus also how newer builtins should be named. At present there are dozens of commands in the root namespace and it's not obvious what is available (unlike Fish that has a string builtin with a lot of functionality grouped inside it)

orefalo commented 1 year ago

ah thank you. must have missed it.

orefalo commented 1 year ago

Reopening this issue - I am interested in match or regexps

$ vm_stat
Mach Virtual Memory Statistics: (page size of 16384 bytes)
Pages free:                              110207.
Pages active:                            856642.
Pages inactive:                          817294.
Pages speculative:                        39658.
Pages throttled:                              0.
Pages wired down:                        201646.
Pages purgeable:                          38864.
"Translation faults":                 584509074.
Pages copy-on-write:                   17797467.
Pages zero filled:                    356362396.
Pages reactivated:                      3156317.
Pages purged:                           1813704.
File-backed pages:                       574258.
Anonymous pages:                        1139336.
Pages stored in compressor:              165434.
Pages occupied by compressor:             16784.
Decompressions:                          697114.
Compressions:                          13578033.
Pageins:                               16590981.
Pageouts:                                 63961.
Swapins:                                 304621.
Swapouts:                               5507225.

$ vm_stat | grep -o -E '[0-9]+'
16384
113692
854433
815431
39476
0
202317
42705
584370609
17793166
356274061
3156084
1813659
574031
1135309
165436
16784
697112
13578033
16590820
63961
304621
5507225

Now, I am trying the same as the above with the regexps built-in


murex-utils » vm_stat | regexp 'm/[0-9]+/'
Mach Virtual Memory Statistics: (page size of 16384 bytes)
Pages free:                              108686.
Pages active:                            858720.
Pages inactive:                          817011.
Pages speculative:                        40710.
Pages throttled:                              0.
Pages wired down:                        200165.
Pages purgeable:                          37305.
"Translation faults":                 584860711.
Pages copy-on-write:                   17815427.
Pages zero filled:                    356571019.
Pages reactivated:                      3156647.
Pages purged:                           1814793.
File-backed pages:                       575200.
Anonymous pages:                        1141241.
Pages stored in compressor:              165404.
Pages occupied by compressor:             16788.
Decompressions:                          697144.
Compressions:                          13578033.
Pageins:                               16591801.
Pageouts:                                 63961.
Swapins:                                 304629.
Swapouts:                               5507225.
murex-utils » vm_stat | regexp 'f/[0-9]+/'
murex-utils » vm_stat | regexp 's/[0-9]+/'
Mach Virtual Memory Statistics: (page size of  bytes)
Pages free:                              .
Pages active:                            .
Pages inactive:                          .
Pages speculative:                        .
Pages throttled:                              .
Pages wired down:                        .
Pages purgeable:                          .
"Translation faults":                 .
Pages copy-on-write:                   .
Pages zero filled:                    .
Pages reactivated:                      .
Pages purged:                           .
File-backed pages:                       .
Anonymous pages:                        .
Pages stored in compressor:              .
Pages occupied by compressor:             .
Decompressions:                          .
Compressions:                          .
Pageins:                               .
Pageouts:                                 .
Swapins:                                 .
Swapouts:                               .
murex-utils »
``

What am I doing wrong? Why would the `m` return the full line and not just the matches?
lmorg commented 1 year ago

m// returns lines that match f// returns found strings.

you might need to wrap your regex in parentheses for f// to work (I can’t recall if I solved that requirement or not).

so for your case, you would need f rather than m (f is like grep -o)

orefalo commented 1 year ago
$ vm_stat | regexp 'f/([0-9])+/'
4
7
0
9
1
0
4
2
0
8
5
8
7
5
5
2
2
3
3
8
1
2
5
orefalo commented 1 year ago

I tried pretty much all options - will look in the code a little later

orefalo commented 1 year ago

goti it!

vm_stat -> regexp   'f/([0-9]+)/'
16384
105049
862046
817713
43910
0
196640
33152
594547466
18024595
364657929
3163876
1835712
567255
1156414
164454
16844
698081
13578033
16611886
63961
304736
5507225
lmorg commented 4 weeks ago

re-opening this because some of these suggestions do deserve proper consideration for inclusion into Murex

lmorg commented 3 weeks ago

ac56e69 added list.join / mjoin