mithrandie / csvq

SQL-like query language for csv
https://mithrandie.github.io/csvq
MIT License
1.5k stars 65 forks source link

Capitalize initials of all names in a string #48

Closed derekmahar closed 2 years ago

derekmahar commented 3 years ago

Given a string that contains multiple lower case names (words), how might I capitalize only the initial letter of these names?

I think that the existing set of string functions in csvq cannot perform this operation. However, this can be done by applying a PCRE2 regular expression substitution which converts all initials to upper case.

Unfortunately, according to Regular Expressions 101, Golang does not support the PCRE2 substitution operator \u, so I presume that the new regular expression functions REGEXP_REPLACE in csvq v1.14.1-pr.1 likewise also cannot perform this operation. If this is the case, would you consider implementing a specific string function or generic set of string functions to capitalize all words in a string?

derekmahar commented 3 years ago

I verified that REGEXP_REPLACE in csvq v1.14.1-pr.1 does not support substitution operator \u:

derek@derek-TB350-BTC:~/Downloads/csvq-v1.14.1-pr.1-linux-amd64$ ./csvq "SELECT REGEXP_REPLACE('abc abc', '(\w+)', '\u\$1')"
+----------------------------------------------+
| REGEXP_REPLACE('abc abc', '(\\w+)', '\\u$1') |
+----------------------------------------------+
| \uabc \uabc                                  |
+----------------------------------------------+
mithrandie commented 3 years ago

I'm sorry if you feel this is rude, but I have a question. Are the features you suggested what you actually needed in some situations? Or are you aiming for the perfect tool?

Csvq is not originally designed to do very complex processing, and there are usually other good solutions for such things. There is no need to try to do everything with one tool when there are many other great tools in the world.

By the way, you can get the results with the current features of csvq. For example, you can create a user-defined function to get the result you want.

Declare the user-defined function in a file named "capitalize.sql".

DECLARE CAPITALIZE FUNCTION (@str) AS BEGIN
    DECLARE @result := '';

    DECLARE @list := REGEXP_FIND_ALL(@str, '\w+');
    DECLARE @i := 0;

    WHILE TRUE
    DO
        DECLARE @item := JSON_VALUE('[' || @i || '][0]', @list);
        IF @item IS NULL THEN
            BREAK;
        END IF;

        IF 0 < LEN(@item) THEN
            IF 0 < LEN(@result) THEN
                @result := @result || ' ';
            END IF;

            @result := @result || UPPER(SUBSTRING(@item FROM 1 FOR 1));
        END IF;

        IF 1 < LEN(@item) THEN
            @result := @result || SUBSTRING(@item FROM 2);
        END IF;

        @i := @i + 1;
    END WHILE;

    RETURN @result;
END;

And run a sql with the SOURCE command.

$ csvq "SOURCE `capitalize.sql`; SELECT CAPITALIZE('capitalize initials of all names')"
+------------------------------------------------+
| CAPITALIZE('capitalize initials of all names') |
+------------------------------------------------+
| Capitalize Initials Of All Names               |
+------------------------------------------------+
derekmahar commented 3 years ago

Thank you for this solution. I don't think your response was rude at all.

I do have a specific purpose. I want to capitalize the initials of names in a column that contains full names in mixed case. ;-) I think this is not such a complicated problem and so I prefer to use csvq than resort to another tool or a lower level programming language like Rust, Java, or heaven forbid, Python.

I've yet to use user defined functions in csvq and I had forgotten that csvq supported them. I have read several sections of the csvq documentation many times, but this time I somehow overlooked the section on user defined functions. I sometimes confuse Miller and csvq's features and so mistakenly thought that Miller had user defined functions, but csvq did not.

derekmahar commented 3 years ago

By the way, for what it's worth, as far as I could tell, SQLite string functions can't do this, either, and SQLite doesn't support user defined functions except those in the form of loadable modules implemented in C.

derekmahar commented 3 years ago

Is JSON_VALUE the only function in csvq which can operate on an array of strings?

mithrandie commented 3 years ago

You can also use the function LIST_ELEM.

$ csvq "SELECT LIST_ELEM('word1 word2', ' ', 1)"
+----------------------------------+
| LIST_ELEM('word1 word2', ' ', 1) |
+----------------------------------+
| word2                            |
+----------------------------------+

I'd like to implement composite types of Array and Map, and syntax and functions to handle them, but it may take a while.

derekmahar commented 3 years ago

I'll be patient, then.

mithrandie commented 3 years ago

Thanks for your response. If you use user-defined functions frequently, you can define them in the $HOME/.csvqrc so that you don't need to run the SOURCE command every time. (See: https://mithrandie.github.io/csvq/reference/command.html#configurations)

The function to capitalize will be provided in the future, but regular expressions will remain as they are now.

derekmahar commented 3 years ago

Thank you. I was just thinking about how to load user defined functions.

I'm satisfied with the user defined CAPITALIZE function.

derekmahar commented 3 years ago

I implemented user defined function CAPITALIZE_WORDS that uses function LIST_ELEM instead of JSON_VALUE:

$ cat csvqrc
DECLARE CAPITALIZE_WORDS FUNCTION (@words) AS
BEGIN
  DECLARE @result := CAPITALIZE(GET_WORD(@words, 1));
  DECLARE @word := '';
  DECLARE @word_count := 2;

  WHILE (@word := GET_WORD(@words, @word_count)) IS NOT NULL
  DO
    @result := @result || ' ' || CAPITALIZE(@word);
    @word_count := @word_count + 1;
  END WHILE;

  RETURN @result;
END;

DECLARE CAPITALIZE FUNCTION (@word) AS
BEGIN
  IF @word = '' THEN
    RETURN '';
  ELSE
    RETURN UPPER(SUBSTR(@word, 0, 1))
      || LOWER(IFNULL(SUBSTR(@word, 1), ''));
  END IF;
END;

DECLARE GET_WORD FUNCTION (@words, @index) AS
BEGIN
  RETURN LIST_ELEM(@words, ' ', @index - 1);
END;
$ csvq "SELECT '\'' || CAPITALIZE_WORDS('  abc  def  ') || '\'';"
+--------------------------------------------------+
| '\'' || CAPITALIZE_WORDS('  abc  def  ') || '\'' |
+--------------------------------------------------+
| '  Abc  Def  '                                   |
+--------------------------------------------------+
$ csvq "SELECT '\'' || CAPITALIZE_WORDS('abc def') || '\'';"
+---------------------------------------------+
| '\'' || CAPITALIZE_WORDS('abc def') || '\'' |
+---------------------------------------------+
| 'Abc Def'                                   |
+---------------------------------------------+
$ csvq "SELECT '\'' || CAPITALIZE_WORDS('abc') || '\'';"
+-----------------------------------------+
| '\'' || CAPITALIZE_WORDS('abc') || '\'' |
+-----------------------------------------+
| 'Abc'                                   |
+-----------------------------------------+
$ csvq "SELECT '\'' || CAPITALIZE_WORDS('') || '\'';"
+--------------------------------------+
| '\'' || CAPITALIZE_WORDS('') || '\'' |
+--------------------------------------+
| ''                                   |
+--------------------------------------+
$ csvq "SELECT '\'' || CAPITALIZE_WORDS(NULL) || '\'';"
+----------------------------------------+
| '\'' || CAPITALIZE_WORDS(NULL) || '\'' |
+----------------------------------------+
|                  NULL                  |
+----------------------------------------+
mithrandie commented 2 years ago

TITLE_CASE function has been included in version 1.17.0.