medialab / xan

The CSV magician
The Unlicense
171 stars 7 forks source link

xan parallel #311

Closed Yomguithereal closed 1 week ago

Yomguithereal commented 1 month ago

This is mostly useful only when aggregating. Then there are two challenges: does it require a HashMap or not.

We can rely on tempfiles if absolutely needed.

Everything like filtering and preprocessing can be abstracted by the command if one of its mode is just to output to stdout.

The command could 1. eat some paths line by line from stdin, 2. the same but from a CSV file, 3. variadic input, 4. a glob string.

We could display a multi progress bar using indicatif.

We can either 1. agg (stats, count), 2. groupby (frequency) and 3. sink output.

In case of output we need to have some rolling buffer for each file, and flush it to a RwLock writer when overflowing allowed buffer.

Is some top/vocab useful? Or is this the same as piping the output of the parallel command.

We also need to assess whether this could be a modality of agg/count/stats/groupby/freq instead.

To allow for complex preprocessing we can probably rely on some $SHELL -c "command" invocation.

Yomguithereal commented 1 month ago

Be sure to correctly manage the child processes when exiting, check kill/wait

https://stackoverflow.com/questions/30538004/how-do-i-ensure-that-a-spawned-child-process-is-killed-if-my-app-panics https://users.rust-lang.org/t/how-to-make-sure-that-child-process-terminates-when-panicking/99189/4

Think of a simplified shlex version relying only on xan and working on windows.