paracrawl / b64filter

Program for operating on one document per Base 64 encoded line files
1 stars 0 forks source link

Bad I/O interaction with perl #1

Closed wwaites closed 4 years ago

wwaites commented 4 years ago

For some input, e.g.

csd3:/rds/project/t2_vol4/rds-t2-cs119/internet_archive/wide00006-shards/mt/78/1/sentences.gz

the program deadlocks when given

echo 'while ($var = <STDIN>) { print($var) }' > cat.perl
gzip -dc sentences.gz | b64filter -q 1 perl cat.perl > /dev/null

however

gzip -dc sentences.gz | b64filter -q 1 cat > /dev/null

works just fine

wwaites commented 4 years ago

Temporary workaround: make the queue size larger than the number of lines in the input file, e.g. -q 1024. This is not ideal because it reads everything into RAM.

wwaites commented 4 years ago

Definitely related to output buffering. This perl script works:

while ($var = <STDIN>) { print($var); STDOUT->flush; }
kpu commented 4 years ago

Dude you need an unbounded queue of line counts. It won't take up much RAM relative to the lines being processed. Long pipelines can produce strange deadlocks otherwise.

kpu commented 4 years ago

https://github.com/kpu/preprocess/blob/master/preprocess/cache_main.cc#L135

wwaites commented 4 years ago

not convinced. the queue won’t use too much ram, but if the queue is growing too quickly, it is because the downstream side cannot consume fast enough. the output is going to be somewhere while it is waiting to be processed. the effect is to keep not only the counts but all of the documents in ram.

kpu commented 4 years ago

You are aware that marian batches sentences to be translated even on CPU and internally sorts sentences by length to build batches of the same length? There are very good performance reasons why this queue will be long. And you're going to deadlock on them.

wwaites commented 4 years ago

I was not aware that marian did that. Still, this is an argument for a queue of some finite size. Surely marian won't blindly consume all of stdin before constructing its batches. For that usage you would have to put an appropriate queue size (the -q argument).

I guess it is possible for people to write programs that consume an unbounded amount of RAM but that doesn't seem like a very good idea. Actually we can consume arbitrarily large amounts of RAM already as there's nothing stopping a base64 like / document from being arbitrarily large. But I think that's a bug, not a feature.

kpu commented 4 years ago

The total amount that can be buffered is very hard to analyze with a long pipeline. Even in Marian with a configurable number of lines, one doesn't know what the sizes of the stdin/stdout buffers (which should be on) are. And they're probably denominated in bytes.

Somebody will inevitably stuff GNU parallel in between.

It's arguably the inner program's responsibility to have a finite buffer. If somebody wraps this around tac, shame on them.

About the worst thing your code can do is deadlock burning time in a way that's hard to detect or analyze. At least die.

More to the point, by requiring a buffer to be set, you are requiring it to be oversized. Which is most likely wasteful.