jqnatividad / qsv

CSVs sliced, diced & analyzed.
The Unlicense
2.26k stars 63 forks source link

partition file into files with n rows each (except for last file) #1788

Closed ggrothendieck closed 2 weeks ago

ggrothendieck commented 1 month ago

It would be nice if qsv partition could partition into n rows per output file (except for last file). I think this would be a frequent use of qsv when using it to preprocess files to be read in by a program in some other language.

We can already do that like this provided we set the format to have sufficient number of zeros so that the files sort correctly but it would be nice if this were easier via a flag on partition. We will use this 50 row file as an example and partition it into 4 row files.

https://raw.githubusercontent.com/dathere/qsv-lookup-tables/main/lookup-tables/us-states-example.csv

This is the code

qsv enum us-states-example.csv | ^
qsv luau map index "string.format('%02d', index // 4)" | ^
qsv select !index | ^
qsv partition --drop index . -
jqnatividad commented 4 weeks ago

I had to tweak your code a bit to get it to work on my end:

qsv enum us-states-example.csv |      
qsv luau map partidx "string.format('%02d', index // 4)" |
qsv select '!/index/' | 
qsv partition --drop partidx . -

Anyway, have you tried combining partition with split?

Here's a small shell script I pulled together using the two commands:

partsplit.sh

#!/bin/bash

partition_column="$1"
partition_results_dir="$2"
partition_input_csv="$3"

qsv partition "$partition_column" "$partition_results_dir" "$partition_input_csv"

# Loop through partitioned files in the results directory
for file in "$partition_results_dir"/*; do
  if [ -f "$file" ]; then
    # split each partitioned CSV into smaller 10-row CSVs
    qsv split "$file"-splitdir --size 10 "$file"
  fi
done

Running it against this file:

https://raw.githubusercontent.com/dathere/qsv-lookup-tables/main/lookup-tables/BLS_SeriesID.csv

I split each partitioned CSV into 10-row CSVs in their own split directories...

cd /tmp
mkdir test
cd test
# copy partsplit.sh to /tmp/test and download BLS_SeriesID.csv to /tmp/test
chmod +x partsplit.sh
./partsplit.sh area_type_code partsplit-resultsdir BLS_SeriesID.csv
Wrote 6 chunk/s to '/private/tmp/test/partsplit-resultsdir/A.csv-splitdir'. Rows/chunk: 10 Num records: 52
Wrote 40 chunk/s to '/private/tmp/test/partsplit-resultsdir/B.csv-splitdir'. Rows/chunk: 10 Num records: 396
Wrote 4 chunk/s to '/private/tmp/test/partsplit-resultsdir/C.csv-splitdir'. Rows/chunk: 10 Num records: 38
Wrote 57 chunk/s to '/private/tmp/test/partsplit-resultsdir/D.csv-splitdir'. Rows/chunk: 10 Num records: 562
Wrote 18 chunk/s to '/private/tmp/test/partsplit-resultsdir/E.csv-splitdir'. Rows/chunk: 10 Num records: 178
Wrote 323 chunk/s to '/private/tmp/test/partsplit-resultsdir/F.csv-splitdir'. Rows/chunk: 10 Num records: 3224
Wrote 194 chunk/s to '/private/tmp/test/partsplit-resultsdir/G.csv-splitdir'. Rows/chunk: 10 Num records: 1940
Wrote 142 chunk/s to '/private/tmp/test/partsplit-resultsdir/H.csv-splitdir'. Rows/chunk: 10 Num records: 1419
Wrote 33 chunk/s to '/private/tmp/test/partsplit-resultsdir/I.csv-splitdir'. Rows/chunk: 10 Num records: 321
Wrote 8 chunk/s to '/private/tmp/test/partsplit-resultsdir/J.csv-splitdir'. Rows/chunk: 10 Num records: 71
Wrote 16 chunk/s to '/private/tmp/test/partsplit-resultsdir/K.csv-splitdir'. Rows/chunk: 10 Num records: 160
Wrote 1 chunk/s to '/private/tmp/test/partsplit-resultsdir/L.csv-splitdir'. Rows/chunk: 10 Num records: 7
Wrote 1 chunk/s to '/private/tmp/test/partsplit-resultsdir/M.csv-splitdir'. Rows/chunk: 10 Num records: 4
Wrote 1 chunk/s to '/private/tmp/test/partsplit-resultsdir/N.csv-splitdir'. Rows/chunk: 10 Num records: 9
ggrothendieck commented 4 weeks ago

I am on Windows so your script won't work for me. A flag on partition would be ideal but if not I don't think I would want something as complex as writing loops or being so dependent on shell scripting.

One additional problem that this brings out, although I have noticed it before, is that if one is trying to do this across different platforms, shells, etc. the use of qsv may have to be adapted to each.

jqnatividad commented 4 weeks ago

Yep - my daily driver is macOS, and we deploy on Ubuntu Linux 22.04 LTS, so it really gets exercised in those environments.

But it should work on WSL or Cygwin as is.

I asked Gemini to port the script to Powershell, should you want to try it out (haven't tested though it as I don't have ready access to Windows):

./partsplit.ps1 <column_name> <results_dir> <input_csv>

partsplit.ps1

$partition_column = $args[0]
$partition_results_dir = $args[1]
$partition_input_csv = $args[2]

# Assume 'qsv' is in your PATH or provide the full path for it
qsv.exe partition $partition_column $partition_results_dir $partition_input_csv

# Loop through partitioned files
Get-ChildItem -Path "$partition_results_dir/*" -File | ForEach-Object {
    $filePath = $_.FullName

    # Split each file into 10-row files
    qsv.exe split "$filePath-splitdir" --size 10 $filePath 
}

And as for running across different platforms consistently using a GUI environment - that's one of the main reasons why we built qsv-pro.

Give it a try - we can extend the license key after the trial too...

ggrothendieck commented 4 weeks ago

Powershell would not work for me because the default for Windows is that it will not run any Powershell scripts. That means I can't give it to a noob without them getting into Windows administration which is a whole new level of complexity. I also can't use non-free software.

ggrothendieck commented 4 weeks ago

GNU split actually does what I was describing although it does not deal with csv headers.

Usage: /usr/bin/split [OPTION]... [FILE [PREFIX]]
Output pieces of FILE to PREFIXaa, PREFIXab, ...;
default size is 1000 lines, and default PREFIX is 'x'.

With no FILE, or when FILE is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   generate suffixes of length N (default 2)
      --additional-suffix=SUFFIX  append an additional SUFFIX to file names
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of records per output file
  -d                      use numeric suffixes starting at 0, not alphabetic
      --numeric-suffixes[=FROM]  same as -d, but allow setting the start value
  -x                      use hex suffixes starting at 0, not alphabetic
      --hex-suffixes[=FROM]  same as -x, but allow setting the start value
  -e, --elide-empty-files  do not generate empty output files with '-n'
      --filter=COMMAND    write to shell COMMAND; file name is $FILE
  -l, --lines=NUMBER      put NUMBER lines/records per output file
  -n, --number=CHUNKS     generate CHUNKS output files; see explanation below
  -t, --separator=SEP     use SEP instead of newline as the record separator;
                            '\0' (zero) specifies the NUL character
  -u, --unbuffered        immediately copy input to output with '-n r/...'
      --verbose           print a diagnostic just before each
                            output file is opened
      --help     display this help and exit
      --version  output version information and exit

The SIZE argument is an integer and optional unit (example: 10K is 10*1024).
Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,... (powers of 1000).
Binary prefixes can be used, too: KiB=K, MiB=M, and so on.

CHUNKS may be:
  N       split into N files based on size of input
  K/N     output Kth of N to stdout
  l/N     split into N files without splitting lines/records
  l/K/N   output Kth of N to stdout without splitting lines/records
  r/N     like 'l' but use round robin distribution
  r/K/N   likewise but only output Kth of N to stdout

GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Report any translation bugs to <https://translationproject.org/team/>
Full documentation <https://www.gnu.org/software/coreutils/split>
or available locally via: info '(coreutils) split invocation'
jqnatividad commented 2 weeks ago

qsv split is actually inspired by GNU split :)

qsv split is CSV-aware, so each chunk will have its own header by default.

Another solution without using a shell script is using qsv luau filter and then piping the filtered output to qsv split...

jqnatividad commented 2 weeks ago

Converting this issue to a discussion, which IMHO, is more appropriate...