Closed ggrothendieck closed 2 weeks ago
I had to tweak your code a bit to get it to work on my end:
qsv enum us-states-example.csv |
qsv luau map partidx "string.format('%02d', index // 4)" |
qsv select '!/index/' |
qsv partition --drop partidx . -
Anyway, have you tried combining partition
with split
?
Here's a small shell script I pulled together using the two commands:
partsplit.sh
#!/bin/bash
partition_column="$1"
partition_results_dir="$2"
partition_input_csv="$3"
qsv partition "$partition_column" "$partition_results_dir" "$partition_input_csv"
# Loop through partitioned files in the results directory
for file in "$partition_results_dir"/*; do
if [ -f "$file" ]; then
# split each partitioned CSV into smaller 10-row CSVs
qsv split "$file"-splitdir --size 10 "$file"
fi
done
Running it against this file:
https://raw.githubusercontent.com/dathere/qsv-lookup-tables/main/lookup-tables/BLS_SeriesID.csv
I split each partitioned CSV into 10-row CSVs in their own split directories...
cd /tmp
mkdir test
cd test
# copy partsplit.sh to /tmp/test and download BLS_SeriesID.csv to /tmp/test
chmod +x partsplit.sh
./partsplit.sh area_type_code partsplit-resultsdir BLS_SeriesID.csv
Wrote 6 chunk/s to '/private/tmp/test/partsplit-resultsdir/A.csv-splitdir'. Rows/chunk: 10 Num records: 52
Wrote 40 chunk/s to '/private/tmp/test/partsplit-resultsdir/B.csv-splitdir'. Rows/chunk: 10 Num records: 396
Wrote 4 chunk/s to '/private/tmp/test/partsplit-resultsdir/C.csv-splitdir'. Rows/chunk: 10 Num records: 38
Wrote 57 chunk/s to '/private/tmp/test/partsplit-resultsdir/D.csv-splitdir'. Rows/chunk: 10 Num records: 562
Wrote 18 chunk/s to '/private/tmp/test/partsplit-resultsdir/E.csv-splitdir'. Rows/chunk: 10 Num records: 178
Wrote 323 chunk/s to '/private/tmp/test/partsplit-resultsdir/F.csv-splitdir'. Rows/chunk: 10 Num records: 3224
Wrote 194 chunk/s to '/private/tmp/test/partsplit-resultsdir/G.csv-splitdir'. Rows/chunk: 10 Num records: 1940
Wrote 142 chunk/s to '/private/tmp/test/partsplit-resultsdir/H.csv-splitdir'. Rows/chunk: 10 Num records: 1419
Wrote 33 chunk/s to '/private/tmp/test/partsplit-resultsdir/I.csv-splitdir'. Rows/chunk: 10 Num records: 321
Wrote 8 chunk/s to '/private/tmp/test/partsplit-resultsdir/J.csv-splitdir'. Rows/chunk: 10 Num records: 71
Wrote 16 chunk/s to '/private/tmp/test/partsplit-resultsdir/K.csv-splitdir'. Rows/chunk: 10 Num records: 160
Wrote 1 chunk/s to '/private/tmp/test/partsplit-resultsdir/L.csv-splitdir'. Rows/chunk: 10 Num records: 7
Wrote 1 chunk/s to '/private/tmp/test/partsplit-resultsdir/M.csv-splitdir'. Rows/chunk: 10 Num records: 4
Wrote 1 chunk/s to '/private/tmp/test/partsplit-resultsdir/N.csv-splitdir'. Rows/chunk: 10 Num records: 9
I am on Windows so your script won't work for me. A flag on partition would be ideal but if not I don't think I would want something as complex as writing loops or being so dependent on shell scripting.
One additional problem that this brings out, although I have noticed it before, is that if one is trying to do this across different platforms, shells, etc. the use of qsv may have to be adapted to each.
Yep - my daily driver is macOS, and we deploy on Ubuntu Linux 22.04 LTS, so it really gets exercised in those environments.
But it should work on WSL or Cygwin as is.
I asked Gemini to port the script to Powershell, should you want to try it out (haven't tested though it as I don't have ready access to Windows):
./partsplit.ps1 <column_name> <results_dir> <input_csv>
partsplit.ps1
$partition_column = $args[0]
$partition_results_dir = $args[1]
$partition_input_csv = $args[2]
# Assume 'qsv' is in your PATH or provide the full path for it
qsv.exe partition $partition_column $partition_results_dir $partition_input_csv
# Loop through partitioned files
Get-ChildItem -Path "$partition_results_dir/*" -File | ForEach-Object {
$filePath = $_.FullName
# Split each file into 10-row files
qsv.exe split "$filePath-splitdir" --size 10 $filePath
}
And as for running across different platforms consistently using a GUI environment - that's one of the main reasons why we built qsv-pro.
Give it a try - we can extend the license key after the trial too...
Powershell would not work for me because the default for Windows is that it will not run any Powershell scripts. That means I can't give it to a noob without them getting into Windows administration which is a whole new level of complexity. I also can't use non-free software.
GNU split actually does what I was describing although it does not deal with csv headers.
Usage: /usr/bin/split [OPTION]... [FILE [PREFIX]]
Output pieces of FILE to PREFIXaa, PREFIXab, ...;
default size is 1000 lines, and default PREFIX is 'x'.
With no FILE, or when FILE is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N generate suffixes of length N (default 2)
--additional-suffix=SUFFIX append an additional SUFFIX to file names
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of records per output file
-d use numeric suffixes starting at 0, not alphabetic
--numeric-suffixes[=FROM] same as -d, but allow setting the start value
-x use hex suffixes starting at 0, not alphabetic
--hex-suffixes[=FROM] same as -x, but allow setting the start value
-e, --elide-empty-files do not generate empty output files with '-n'
--filter=COMMAND write to shell COMMAND; file name is $FILE
-l, --lines=NUMBER put NUMBER lines/records per output file
-n, --number=CHUNKS generate CHUNKS output files; see explanation below
-t, --separator=SEP use SEP instead of newline as the record separator;
'\0' (zero) specifies the NUL character
-u, --unbuffered immediately copy input to output with '-n r/...'
--verbose print a diagnostic just before each
output file is opened
--help display this help and exit
--version output version information and exit
The SIZE argument is an integer and optional unit (example: 10K is 10*1024).
Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,... (powers of 1000).
Binary prefixes can be used, too: KiB=K, MiB=M, and so on.
CHUNKS may be:
N split into N files based on size of input
K/N output Kth of N to stdout
l/N split into N files without splitting lines/records
l/K/N output Kth of N to stdout without splitting lines/records
r/N like 'l' but use round robin distribution
r/K/N likewise but only output Kth of N to stdout
GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Report any translation bugs to <https://translationproject.org/team/>
Full documentation <https://www.gnu.org/software/coreutils/split>
or available locally via: info '(coreutils) split invocation'
qsv split
is actually inspired by GNU split
:)
qsv split
is CSV-aware, so each chunk will have its own header by default.
Another solution without using a shell script is using qsv luau filter
and then piping the filtered output to qsv split
...
Converting this issue to a discussion, which IMHO, is more appropriate...
It would be nice if qsv partition could partition into n rows per output file (except for last file). I think this would be a frequent use of qsv when using it to preprocess files to be read in by a program in some other language.
We can already do that like this provided we set the format to have sufficient number of zeros so that the files sort correctly but it would be nice if this were easier via a flag on partition. We will use this 50 row file as an example and partition it into 4 row files.
https://raw.githubusercontent.com/dathere/qsv-lookup-tables/main/lookup-tables/us-states-example.csv
This is the code