Closed derekmahar closed 4 years ago
There is probably no way to clearly recognize the separation of the original files in the stream that is concatenated and passed to standard input.
If you know the contents of the files, you can get the same result a little easier with the following methods.
Files
file1.txt
col1,col2,col3
1,A,foo
file2.txt
col1,col2,col3
2,B,bar
Remove the header row by using WHERE clause.
$ cat file1.txt file2.txt | csvq "SELECT * FROM STDIN WHERE NOT col1 == 'col1'"
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 1 | A | foo |
| 2 | B | bar |
+------+------+------+
Use sed
command to remove the header row.
$ sed "1d" file2.txt | cat file1.txt - | csvq "SELECT * FROM STDIN"
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 1 | A | foo |
| 2 | B | bar |
+------+------+------+
It will be useful if such an implementation is possible, but the required implementation depends on respective situations, and I think that it is better to use different alternative methods depending on the situation.
Instead of, or in addition to, standard input, would you consider having csvq
accept multiple CSV files as command-line arguments?
Why complicate matters?
It works as a pipe, so use the other unix tools in the pipe line :-)-O
I have hacked me some complicated bash scripts, put them into bash functions and load them on log-in.
Works very nicely.
el
On 21/11/2019 15:34, Derek Mahar wrote:
Instead of, or in addition to, standard input, would you consider having
csvq
accept multiple CSV files as command-line arguments?
-- Dr. Eberhard W. Lisse \ / Obstetrician & Gynaecologist el@lisse.NA / * | Telephone: +264 81 124 6733 (cell) PO Box 8421 \ / Bachbrecht 10007, Namibia ;____/
No, I'm not consider about implementing such a command option. If you want to read multiple files, not standard input, then you can use join or set operators, such as UNION, in queries.
Why complicate matters? It works as a pipe, so use the other unix tools in the pipe line :-)-O I have hacked me some complicated bash scripts, put them into bash functions and load them on log-in. Works very nicely. el
I agree with you in principle, but I am migrating some scripts from Miller to csvq
and Miller cat
supports a list of input file name arguments:
$ printf "col1,col2,col3\n1,A,foo\n" > file1.csv
$ printf "col1,col2,col3\n2,B,bar\n" > file2.csv
$ cat file1.csv file2.csv
col1,col2,col3
1,A,foo
col1,col2,col3
2,B,bar
$ mlr --csv cat file1.csv file2.csv
col1,col2,col3
1,A,foo
2,B,bar
In my case, I have a directory that contains many CSV input files whose file names contain spaces and that all contain the same column header. If csvq
could accept file name arguments, I could simplify the script
{
SAVE_IFS=$IFS
IFS=$(echo -en "\n\b")
declare -a statements
statements=( "$@" )
if [ ${#statements[@]} -gt 0 ]
then
cat "${statements[0]}"
fi
for statement in "${statements[@]:1}"
do
cat "$statement" | sed 1d
done
IFS=$SAVE_IFS
} | csvq --format CSV --without-header --source select-query.sql
to the single command
csvq --format CSV --without-header --source select-query.sql "$@"
Note that the first script must change IFS
because using its default value, Bash command for
uses IFS to break a list of words on spaces and the list of file names in array statements
contain spaces.
Absent the csvq
facility to read CSV files specified as arguments, I will try to explore ways to simplify the first script or just use Miller to concatenate the CSV files into a single CSV file that contains only a header on the first line.
After some experimentation, I found a simpler solution that uses cat
, sed
, Bash arrays, and special "empty" source and sink file /dev/null
:
$ printf "col1,col2,col3\n1,A,foo\n" > /tmp/"file 1".csv
$ printf "col1,col2,col3\n2,B,bar\n" > /tmp/"file 2".csv
$ { files=(/tmp/file*.csv); cat "${files[@]:0:1}" /dev/null; sed "1d" "${files[@]:1}" /dev/null; } | csvq "SELECT * FROM STDIN"
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 1 | A | foo |
| 2 | B | bar |
+------+------+------+
$ { files=(); cat "${files[@]:0:1}" /dev/null; sed "1d" "${files[@]:1}" /dev/null; } | csvq "SELECT * FROM STDIN"
Empty Fields
$
Notes:
cat
and sed
both include special "empty" file argument /dev/null
to prevent them from blocking on an empty file list.${files[@]:0:1}
is an array slice that resolves to the first file name in the array (the head) and produces the empty list when array files
is empty. ${files[0]}
would also resolve to the first file name in the array, but produces an empty string when files
is empty causing command cat "" /dev/null
to report error cat: '': No such file or directory
.${files[@]:1}
is an array slice that resolves to all of the file names in the array except the first (the tail) and also produces the empty list when array files
is empty.When ${file[@]}
is surrounded by quotation marks (i.e. "${files[@]}"
), it does not split each file name on white space boundaries. Compare
$ { files=(/tmp/file*.csv); for file in ${files[@]}; do echo "[$file]"; done; }
[/tmp/file]
[1.csv]
[/tmp/file]
[2.csv]
which does not surround ${files[@]}
with quotation marks, to
$ { files=(/tmp/file*.csv); for file in "${files[@]}"; do echo "[$file]"; done; }
[/tmp/file 1.csv]
[/tmp/file 2.csv]
which does surround ${files[@]}
with quotation marks.
This also demonstrates that my initial script would have worked correctly with file names that contain spaces even without changing IFS
.
This solution is also significantly faster than the for-loop in the original script because it invokes cat
and sed
only once instead of once per CSV input file.
Okay, I understand what you want to do.
However, the cat
is a subcommand of Miller, not a standard utility on unix systems.
Miller and csvq are designed based on different ideas, and the usages of two tools are widely different, so implementing the feature in csvq may make it difficult to understand how to use csvq.
After all, I think it should be formatted before passing the data to csvq.
The solution using bash array is very simple and nice, I like it!👍
Unfortuntely, it turns out that my solution doesn't work as intended for more than two files. sed
applies its script to the concatenation of all of its file name arguments, not in sequence as I had thought. So, if you apply my solution to three or more files, sed
strips the header from only the second file:
$ printf "col1,col2,col3\n1,A,foo\n" > /tmp/"file 1".csv
$ printf "col1,col2,col3\n2,B,bar\n" > /tmp/"file 2".csv
$ printf "col1,col2,col3\n3,C,baz\n" > /tmp/"file 3".csv
$ { files=(/tmp/file*.csv); cat "${files[@]:0:1}" /dev/null; sed "1d" "${files[@]:1}" /dev/null; } | csvq "SELECT * FROM STDIN"
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 1 | A | foo |
| 2 | B | bar |
| col1 | col2 | col3 |
| 3 | C | baz |
+------+------+------+
Back to the drawing board! (In the meantime, I'll just use mlr cat
to strip all of the headers except the first.)
Try for loop?
el
On 2019-11-22 20:57 , Derek Mahar wrote:
Unfortuntely, it turns out that my solution doesn't work as intended for more than two files.
sed
applies its script to the concatenation of all of its file name arguments, not in sequence as I had thought. So, if you run apply my solution to three or more files,sed
strips the header from only the second file:printf "col1,col2,col3\n1,A,foo\n" > /tmp/"file 1".csv printf "col1,col2,col3\n2,B,bar\n" > /tmp/"file 2".csv printf "col1,col2,col3\n3,C,baz\n" > /tmp/"file 3".csv { files=(/tmp/file*.csv); cat "${files[@]:0:1}" /dev/null; sed "1d" "${files[@]:1}" /dev/null; } | csvq "SELECT * FROM STDIN" +------+------+------+ | col1 | col2 | col3 | +------+------+------+ | 1 | A | foo | | 2 | B | bar | | col1 | col2 | col3 | | 3 | C | baz | +------+------+------+
Back to the drawing board! (In the meantime, I'll just use
mlr cat
to strip all of the headers except the first.)
-- Dr. Eberhard W. Lisse / Obstetrician & Gynaecologist (Saar) el@lisse.NA / * | Telephone: +264 81 124 6733 (cell) PO Box 8421 / Bachbrecht, Namibia ;____/
Try for loop? el
This would work and is precisely what my original script did, but it's slower because the for-loop invokes sed
for each file name.
Maybe you can use xargs
.
{ files=(/tmp/file*.csv); cat "${files[@]:0:1}"; echo "${files[@]:1}" | xargs -n1 sed "1d"; }
Does it have to?
Did you try head -1 or something similar?
And how much time are we talking about, anyway? I.e. is the speed difference really worth redoing all of this?
greetings, el
On 2019-11-22 22:16 , Derek Mahar wrote:
Try for loop? el
This would work and is precisely what my original script did, but it's slower because the for-loop invokes sed for each file name.
-- Dr. Eberhard W. Lisse / Obstetrician & Gynaecologist (Saar) el@lisse.NA / * | Telephone: +264 81 124 6733 (cell) PO Box 8421 / Bachbrecht, Namibia ;____/
I can strip the header (first line) from each file if I replace sed "1d"
with tail -qn +2
:
$ printf "col1,col2,col3\n1,A,foo1\n1,A,foo2\n1,A,foo3\n" > /tmp/"file 1".csv
$ printf "col1,col2,col3\n2,B,bar\n" > /tmp/"file 2".csv
$ printf "col1,col2,col3\n3,C,baz\n" > /tmp/"file 3".csv
$ { files=(/tmp/file*.csv); cat "${files[@]:0:1}" /dev/null; tail -qn +2 "${files[@]:1}" /dev/null; } | csvq "SELECT * FROM STDIN"
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 1 | A | foo1 |
| 1 | A | foo2 |
| 1 | A | foo3 |
| 2 | B | bar |
| 3 | C | baz |
+------+------+------+
Does it have to? Did you try head -1 or something similar? And how much time are we talking about, anyway? I.e. is the speed difference really worth redoing all of this? greetings, el
tail -nq +2
does what I need. Yes, it makes a difference because I have many CSV input files.
Maybe you can use
xargs
.
xargs
will work, but it also invokes sed
once for each input file.
I just learned that sed
has an option --separate
or -s
to treat each input file separately rather than as a single stream:
$ printf "col1,col2,col3\n1,A,foo1\n1,A,foo2\n1,A,foo3\n" > /tmp/"file 1".csv
$ printf "col1,col2,col3\n2,B,bar\n" > /tmp/"file 2".csv
$ printf "col1,col2,col3\n3,C,baz\n" > /tmp/"file 3".csv
$ { files=(/tmp/file*.csv); cat "${files[@]:0:1}" /dev/null; sed -s "1d" "${files[@]:1}" /dev/null; } | csvq "SELECT * FROM STDIN"
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 1 | A | foo1 |
| 1 | A | foo2 |
| 1 | A | foo3 |
| 2 | B | bar |
| 3 | C | baz |
+------+------+------+
You are having a shell problem not a CSVQ one :-)-O
I do something like this.
FILES="$*"
if [ $(ls -1 $FILES 2>/dev/null | wc -l) -gt 0 ]
then
( /bin/echo -n "type,sequence,unique,file," \
&& /bin/echo -n "identifier,medaid," \
&& /bin/echo -n "notused1,memberno," \
&& /bin/echo "assistantname" ) > e.csv
sed '1d' $FILES >> e.csv
csvq "SELECT type,sequence,unique,file FROM e"
rm -f e.csv
else
echo "Can't find $FILES"
fi
The multiple echo lines are just for readbility :-)-O
On 2019-11-22 22:34 , Derek Mahar wrote:
Does it have to? Did you try head -1 or something similar? And how much time are we talking about, anyway? I.e. is the speed difference really worth redoing all of this? greetings, el
tail -nq +2
does what I need. Yes, it makes a difference because I have many CSV input files.
-- Dr. Eberhard W. Lisse / Obstetrician & Gynaecologist (Saar) el@lisse.NA / * | Telephone: +264 81 124 6733 (cell) PO Box 8421 / Bachbrecht, Namibia ;____/
Ah, need to play with sed -s myself :-)-O
el
On 2019-11-22 23:04 , Dr Eberhard W Lisse wrote:
You are having a shell problem not a CSVQ one :-)-O
I do something like this.
FILES="$*" if [ $(ls -1 $FILES 2>/dev/null | wc -l) -gt 0 ] then ( /bin/echo -n "type,sequence,unique,file," \ && /bin/echo -n "identifier,medaid," \ && /bin/echo -n "notused1,memberno," \ && /bin/echo "assistantname" ) > e.csv sed '1d' $FILES >> e.csv csvq "SELECT type,sequence,unique,file FROM e" rm -f e.csv else echo "Can't find $FILES" fi
The multiple echo lines are just for readbility :-)-O
On 2019-11-22 22:34 , Derek Mahar wrote:
Does it have to? Did you try head -1 or something similar? And how much time are we talking about, anyway? I.e. is the speed difference really worth redoing all of this? greetings, el
tail -nq +2
does what I need. Yes, it makes a difference because I have many CSV input files.
-- Dr. Eberhard W. Lisse / Obstetrician & Gynaecologist (Saar) el@lisse.NA / * | Telephone: +264 81 124 6733 (cell) PO Box 8421 / Bachbrecht, Namibia ;____/
You are having a shell problem not a CSVQ one :-)-O
Yes, and I've fixed it.
For a very large number of files, the following script produces the same result, is about as fast, but won't exceed any command line length limits:
$ printf "col1,col2,col3\n1,A,foo1\n1,A,foo2\n1,A,foo3\n" > /tmp/"file 1".csv
$ printf "col1,col2,col3\n2,B,bar\n" > /tmp/"file 2".csv
$ printf "col1,col2,col3\n3,C,baz\n" > /tmp/"file 3".csv
$ { find /tmp/file*.csv -print0 | head -zn 1 | xargs -0 cat; find /tmp/file*.csv -print0 | tail -zn +2 | xargs -0 tail -qn +2; } | csvq "SELECT * FROM STDIN"
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 1 | A | foo1 |
| 1 | A | foo2 |
| 1 | A | foo3 |
| 2 | B | bar |
| 3 | C | baz |
+------+------+------+
Notes:
find /tmp/file*.csv -print0
selects all file names that match the pattern as a null delimited list.head -zn 1
selects only the first file name from a null delimited list.tail -zn +2
skips the first file name in a null delimited list.xargs -0 <command>
splits a null delimited file name list into fragments, and for each fragment, executes <command>
, passing the fragment as the list of file name arguments. This version invokes find
only once, stores the list of file names in a temporary file, and supplies this list of file names separately to head -zn 1 | xargs -0 cat
and tail -zn +2 | xargs -0 tail -qn +2
, ensuring that they each operate on the same list:
$ printf "col1,col2,col3\n1,A,foo1\n1,A,foo2\n1,A,foo3\n" > /tmp/"file 1".csv
$ printf "col1,col2,col3\n2,B,bar\n" > /tmp/"file 2".csv
$ printf "col1,col2,col3\n3,C,baz\n" > /tmp/"file 3".csv
$ { files=`mktemp`; find /tmp/file*.csv -print0 > $files; cat $files | head -zn 1 | xargs -0 cat; cat $files | tail -zn +2 | xargs -0 tail -qn +2; rm $files; } | csvq "SELECT * FROM STDIN"
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 1 | A | foo1 |
| 1 | A | foo2 |
| 1 | A | foo3 |
| 2 | B | bar |
| 3 | C | baz |
+------+------+------+
When receiving from standard input multiple CSV files having the same column headers,
csvq
should automatically concatenate these files into a single CSV such that the query result is the union of all the input data rows. This would simplify the command required to import multiple CSV files from one which requires a complicated for-loop that strips all header rows but the first to a simpler command which requires only a singlecat
command.In the event that
csvq
encounters a row in the input stream that differs from the first row (header) in the stream by at least one column, but not all columns,csvq
would fail and report an error. Any row which differs from the header in all columns and which is type compatible with the second row in the stream,csvq
would treat as a data row.