mithrandie / csvq

SQL-like query language for csv
https://mithrandie.github.io/csvq
MIT License
1.5k stars 65 forks source link

When receiving from standard input multiple CSV input files having common headers, automatically concatenate the data rows. #18

Closed derekmahar closed 4 years ago

derekmahar commented 4 years ago

When receiving from standard input multiple CSV files having the same column headers, csvq should automatically concatenate these files into a single CSV such that the query result is the union of all the input data rows. This would simplify the command required to import multiple CSV files from one which requires a complicated for-loop that strips all header rows but the first to a simpler command which requires only a single cat command.

In the event that csvq encounters a row in the input stream that differs from the first row (header) in the stream by at least one column, but not all columns, csvq would fail and report an error. Any row which differs from the header in all columns and which is type compatible with the second row in the stream, csvq would treat as a data row.

mithrandie commented 4 years ago

There is probably no way to clearly recognize the separation of the original files in the stream that is concatenated and passed to standard input.

If you know the contents of the files, you can get the same result a little easier with the following methods.

  1. Remove the header row by using WHERE clause.

    $ cat file1.txt file2.txt | csvq "SELECT * FROM STDIN WHERE NOT col1 == 'col1'"
    +------+------+------+
    | col1 | col2 | col3 |
    +------+------+------+
    | 1    | A    | foo  |
    | 2    | B    | bar  |
    +------+------+------+
  2. Use sed command to remove the header row.

    $ sed "1d" file2.txt | cat file1.txt - | csvq "SELECT * FROM STDIN"
    +------+------+------+
    | col1 | col2 | col3 |
    +------+------+------+
    | 1    | A    | foo  |
    | 2    | B    | bar  |
    +------+------+------+

It will be useful if such an implementation is possible, but the required implementation depends on respective situations, and I think that it is better to use different alternative methods depending on the situation.

derekmahar commented 4 years ago

Instead of, or in addition to, standard input, would you consider having csvq accept multiple CSV files as command-line arguments?

ondohotola commented 4 years ago

Why complicate matters?

It works as a pipe, so use the other unix tools in the pipe line :-)-O

I have hacked me some complicated bash scripts, put them into bash functions and load them on log-in.

Works very nicely.

el

On 21/11/2019 15:34, Derek Mahar wrote:

Instead of, or in addition to, standard input, would you consider having csvq accept multiple CSV files as command-line arguments?

-- Dr. Eberhard W. Lisse \ / Obstetrician & Gynaecologist el@lisse.NA / * | Telephone: +264 81 124 6733 (cell) PO Box 8421 \ / Bachbrecht 10007, Namibia ;____/

mithrandie commented 4 years ago

No, I'm not consider about implementing such a command option. If you want to read multiple files, not standard input, then you can use join or set operators, such as UNION, in queries.

derekmahar commented 4 years ago

Why complicate matters? It works as a pipe, so use the other unix tools in the pipe line :-)-O I have hacked me some complicated bash scripts, put them into bash functions and load them on log-in. Works very nicely. el

I agree with you in principle, but I am migrating some scripts from Miller to csvq and Miller cat supports a list of input file name arguments:

$ printf "col1,col2,col3\n1,A,foo\n" > file1.csv
$ printf "col1,col2,col3\n2,B,bar\n" > file2.csv
$ cat file1.csv file2.csv
col1,col2,col3
1,A,foo
col1,col2,col3
2,B,bar
$ mlr --csv cat file1.csv file2.csv
col1,col2,col3
1,A,foo
2,B,bar

In my case, I have a directory that contains many CSV input files whose file names contain spaces and that all contain the same column header. If csvq could accept file name arguments, I could simplify the script

  {
    SAVE_IFS=$IFS
    IFS=$(echo -en "\n\b")
    declare -a statements
    statements=( "$@" )

    if [ ${#statements[@]} -gt 0 ]
    then
      cat "${statements[0]}"
    fi

    for statement in "${statements[@]:1}"
    do
      cat "$statement" | sed 1d
    done
    IFS=$SAVE_IFS
  } | csvq --format CSV --without-header --source select-query.sql

to the single command

csvq --format CSV --without-header --source select-query.sql "$@"

Note that the first script must change IFS because using its default value, Bash command for uses IFS to break a list of words on spaces and the list of file names in array statements contain spaces.

Absent the csvq facility to read CSV files specified as arguments, I will try to explore ways to simplify the first script or just use Miller to concatenate the CSV files into a single CSV file that contains only a header on the first line.

derekmahar commented 4 years ago

After some experimentation, I found a simpler solution that uses cat, sed, Bash arrays, and special "empty" source and sink file /dev/null:

$ printf "col1,col2,col3\n1,A,foo\n" > /tmp/"file 1".csv
$ printf "col1,col2,col3\n2,B,bar\n" > /tmp/"file 2".csv
$ { files=(/tmp/file*.csv); cat "${files[@]:0:1}" /dev/null; sed "1d" "${files[@]:1}" /dev/null; } | csvq "SELECT * FROM STDIN"
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 1    | A    | foo  |
| 2    | B    | bar  |
+------+------+------+
$ { files=(); cat "${files[@]:0:1}" /dev/null; sed "1d" "${files[@]:1}" /dev/null; } | csvq "SELECT * FROM STDIN"
Empty Fields
$

Notes:

  1. Commands cat and sed both include special "empty" file argument /dev/null to prevent them from blocking on an empty file list.
  2. ${files[@]:0:1} is an array slice that resolves to the first file name in the array (the head) and produces the empty list when array files is empty. ${files[0]} would also resolve to the first file name in the array, but produces an empty string when files is empty causing command cat "" /dev/null to report error cat: '': No such file or directory.
  3. ${files[@]:1} is an array slice that resolves to all of the file names in the array except the first (the tail) and also produces the empty list when array files is empty.
  4. When ${file[@]} is surrounded by quotation marks (i.e. "${files[@]}"), it does not split each file name on white space boundaries. Compare

    $ { files=(/tmp/file*.csv); for file in ${files[@]}; do echo "[$file]"; done; }
    [/tmp/file]
    [1.csv]
    [/tmp/file]
    [2.csv]

    which does not surround ${files[@]} with quotation marks, to

    $ { files=(/tmp/file*.csv); for file in "${files[@]}"; do echo "[$file]"; done; }
    [/tmp/file 1.csv]
    [/tmp/file 2.csv]

    which does surround ${files[@]} with quotation marks.

    This also demonstrates that my initial script would have worked correctly with file names that contain spaces even without changing IFS.

This solution is also significantly faster than the for-loop in the original script because it invokes cat and sed only once instead of once per CSV input file.

mithrandie commented 4 years ago

Okay, I understand what you want to do. However, the cat is a subcommand of Miller, not a standard utility on unix systems. Miller and csvq are designed based on different ideas, and the usages of two tools are widely different, so implementing the feature in csvq may make it difficult to understand how to use csvq. After all, I think it should be formatted before passing the data to csvq.

The solution using bash array is very simple and nice, I like it!👍

derekmahar commented 4 years ago

Unfortuntely, it turns out that my solution doesn't work as intended for more than two files. sed applies its script to the concatenation of all of its file name arguments, not in sequence as I had thought. So, if you apply my solution to three or more files, sed strips the header from only the second file:

$ printf "col1,col2,col3\n1,A,foo\n" > /tmp/"file 1".csv
$ printf "col1,col2,col3\n2,B,bar\n" > /tmp/"file 2".csv
$ printf "col1,col2,col3\n3,C,baz\n" > /tmp/"file 3".csv
$ { files=(/tmp/file*.csv); cat "${files[@]:0:1}" /dev/null; sed "1d" "${files[@]:1}" /dev/null; } | csvq "SELECT * FROM STDIN"
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 1    | A    | foo  |
| 2    | B    | bar  |
| col1 | col2 | col3 |
| 3    | C    | baz  |
+------+------+------+ 

Back to the drawing board! (In the meantime, I'll just use mlr cat to strip all of the headers except the first.)

ondohotola commented 4 years ago

Try for loop?

el

On 2019-11-22 20:57 , Derek Mahar wrote:

Unfortuntely, it turns out that my solution doesn't work as intended for more than two files. sed applies its script to the concatenation of all of its file name arguments, not in sequence as I had thought. So, if you run apply my solution to three or more files, sed strips the header from only the second file:

printf "col1,col2,col3\n1,A,foo\n" > /tmp/"file 1".csv
printf "col1,col2,col3\n2,B,bar\n" > /tmp/"file 2".csv
printf "col1,col2,col3\n3,C,baz\n" > /tmp/"file 3".csv
{ files=(/tmp/file*.csv); cat "${files[@]:0:1}" /dev/null; sed "1d"
"${files[@]:1}" /dev/null; } | csvq "SELECT * FROM STDIN"
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 1    | A    | foo  |
| 2    | B    | bar  |
| col1 | col2 | col3 |
| 3    | C    | baz  |
+------+------+------+

Back to the drawing board! (In the meantime, I'll just use mlr cat to strip all of the headers except the first.)

-- Dr. Eberhard W. Lisse / Obstetrician & Gynaecologist (Saar) el@lisse.NA / * | Telephone: +264 81 124 6733 (cell) PO Box 8421 / Bachbrecht, Namibia ;____/

derekmahar commented 4 years ago

Try for loop? el

This would work and is precisely what my original script did, but it's slower because the for-loop invokes sed for each file name.

mithrandie commented 4 years ago

Maybe you can use xargs.

{ files=(/tmp/file*.csv); cat "${files[@]:0:1}"; echo "${files[@]:1}" | xargs -n1 sed "1d"; }
ondohotola commented 4 years ago

Does it have to?

Did you try head -1 or something similar?

And how much time are we talking about, anyway? I.e. is the speed difference really worth redoing all of this?

greetings, el

On 2019-11-22 22:16 , Derek Mahar wrote:

Try for loop? el

This would work and is precisely what my original script did, but it's slower because the for-loop invokes sed for each file name.

-- Dr. Eberhard W. Lisse / Obstetrician & Gynaecologist (Saar) el@lisse.NA / * | Telephone: +264 81 124 6733 (cell) PO Box 8421 / Bachbrecht, Namibia ;____/

derekmahar commented 4 years ago

I can strip the header (first line) from each file if I replace sed "1d" with tail -qn +2:

$ printf "col1,col2,col3\n1,A,foo1\n1,A,foo2\n1,A,foo3\n" > /tmp/"file 1".csv
$ printf "col1,col2,col3\n2,B,bar\n" > /tmp/"file 2".csv
$ printf "col1,col2,col3\n3,C,baz\n" > /tmp/"file 3".csv
$ { files=(/tmp/file*.csv); cat "${files[@]:0:1}" /dev/null; tail -qn +2 "${files[@]:1}" /dev/null; } | csvq "SELECT * FROM STDIN"
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 1    | A    | foo1 |
| 1    | A    | foo2 |
| 1    | A    | foo3 |
| 2    | B    | bar  |
| 3    | C    | baz  |
+------+------+------+
derekmahar commented 4 years ago

Does it have to? Did you try head -1 or something similar? And how much time are we talking about, anyway? I.e. is the speed difference really worth redoing all of this? greetings, el

tail -nq +2 does what I need. Yes, it makes a difference because I have many CSV input files.

derekmahar commented 4 years ago

Maybe you can use xargs.

xargs will work, but it also invokes sed once for each input file.

derekmahar commented 4 years ago

I just learned that sed has an option --separate or -s to treat each input file separately rather than as a single stream:

$ printf "col1,col2,col3\n1,A,foo1\n1,A,foo2\n1,A,foo3\n" > /tmp/"file 1".csv
$ printf "col1,col2,col3\n2,B,bar\n" > /tmp/"file 2".csv
$ printf "col1,col2,col3\n3,C,baz\n" > /tmp/"file 3".csv
$ { files=(/tmp/file*.csv); cat "${files[@]:0:1}" /dev/null; sed -s "1d" "${files[@]:1}" /dev/null; } | csvq "SELECT * FROM STDIN"
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 1    | A    | foo1 |
| 1    | A    | foo2 |
| 1    | A    | foo3 |
| 2    | B    | bar  |
| 3    | C    | baz  |
+------+------+------+
ondohotola commented 4 years ago

You are having a shell problem not a CSVQ one :-)-O

I do something like this.

FILES="$*"
if [ $(ls -1 $FILES 2>/dev/null | wc -l) -gt 0 ]
then
    ( /bin/echo -n "type,sequence,unique,file," \
    && /bin/echo -n "identifier,medaid," \
    && /bin/echo -n "notused1,memberno," \
    && /bin/echo "assistantname" ) > e.csv
    sed '1d' $FILES >> e.csv
    csvq "SELECT type,sequence,unique,file FROM e"
    rm -f e.csv
else
    echo "Can't find $FILES"
fi

The multiple echo lines are just for readbility :-)-O

On 2019-11-22 22:34 , Derek Mahar wrote:

Does it have to? Did you try head -1 or something similar? And how much time are we talking about, anyway? I.e. is the speed difference really worth redoing all of this? greetings, el

tail -nq +2 does what I need. Yes, it makes a difference because I have many CSV input files.

-- Dr. Eberhard W. Lisse / Obstetrician & Gynaecologist (Saar) el@lisse.NA / * | Telephone: +264 81 124 6733 (cell) PO Box 8421 / Bachbrecht, Namibia ;____/

ondohotola commented 4 years ago

Ah, need to play with sed -s myself :-)-O

el

On 2019-11-22 23:04 , Dr Eberhard W Lisse wrote:

You are having a shell problem not a CSVQ one :-)-O

I do something like this.

FILES="$*" if [ $(ls -1 $FILES 2>/dev/null | wc -l) -gt 0 ] then ( /bin/echo -n "type,sequence,unique,file," \ && /bin/echo -n "identifier,medaid," \ && /bin/echo -n "notused1,memberno," \ && /bin/echo "assistantname" ) > e.csv sed '1d' $FILES >> e.csv csvq "SELECT type,sequence,unique,file FROM e" rm -f e.csv else echo "Can't find $FILES" fi

The multiple echo lines are just for readbility :-)-O

On 2019-11-22 22:34 , Derek Mahar wrote:

Does it have to? Did you try head -1 or something similar? And how much time are we talking about, anyway? I.e. is the speed difference really worth redoing all of this? greetings, el

tail -nq +2 does what I need. Yes, it makes a difference because I have many CSV input files.

-- Dr. Eberhard W. Lisse / Obstetrician & Gynaecologist (Saar) el@lisse.NA / * | Telephone: +264 81 124 6733 (cell) PO Box 8421 / Bachbrecht, Namibia ;____/

derekmahar commented 4 years ago

You are having a shell problem not a CSVQ one :-)-O

Yes, and I've fixed it.

derekmahar commented 4 years ago

For a very large number of files, the following script produces the same result, is about as fast, but won't exceed any command line length limits:

$ printf "col1,col2,col3\n1,A,foo1\n1,A,foo2\n1,A,foo3\n" > /tmp/"file 1".csv
$ printf "col1,col2,col3\n2,B,bar\n" > /tmp/"file 2".csv
$ printf "col1,col2,col3\n3,C,baz\n" > /tmp/"file 3".csv
$ { find /tmp/file*.csv -print0 | head -zn 1 | xargs -0 cat; find /tmp/file*.csv -print0 | tail -zn +2 | xargs -0 tail -qn +2; } | csvq "SELECT * FROM STDIN"
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 1    | A    | foo1 |
| 1    | A    | foo2 |
| 1    | A    | foo3 |
| 2    | B    | bar  |
| 3    | C    | baz  |
+------+------+------+

Notes:

  1. find /tmp/file*.csv -print0 selects all file names that match the pattern as a null delimited list.
  2. head -zn 1 selects only the first file name from a null delimited list.
  3. tail -zn +2 skips the first file name in a null delimited list.
  4. xargs -0 <command> splits a null delimited file name list into fragments, and for each fragment, executes <command>, passing the fragment as the list of file name arguments.
derekmahar commented 4 years ago

This version invokes find only once, stores the list of file names in a temporary file, and supplies this list of file names separately to head -zn 1 | xargs -0 cat and tail -zn +2 | xargs -0 tail -qn +2, ensuring that they each operate on the same list:

$ printf "col1,col2,col3\n1,A,foo1\n1,A,foo2\n1,A,foo3\n" > /tmp/"file 1".csv
$ printf "col1,col2,col3\n2,B,bar\n" > /tmp/"file 2".csv
$ printf "col1,col2,col3\n3,C,baz\n" > /tmp/"file 3".csv
$ { files=`mktemp`; find /tmp/file*.csv -print0 > $files; cat $files | head -zn 1 | xargs -0 cat; cat $files | tail -zn +2 | xargs -0 tail -qn +2; rm $files; } | csvq "SELECT * FROM STDIN"
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 1    | A    | foo1 |
| 1    | A    | foo2 |
| 1    | A    | foo3 |
| 2    | B    | bar  |
| 3    | C    | baz  |
+------+------+------+