Support process substitution to read keys from std::in

jermp commented 7 months ago

As suggested by @zacchiro, we should support process substitution to read keys from std::in.

Once done, we can specify, e.g., -i < (cat file) or -i < (zcat file.gz), etc.

@roberto-trani suggested to use the same trick as used in grep (https://man7.org/linux/man-pages/man1/grep.1.html) where

     -f FILE, --file=FILE

          Obtain patterns from FILE, one per line.  If this option
          is used multiple times or is combined with the -e
          (--regexp) option, search for all patterns given.  The
          empty file contains zero patterns, and therefore matches
          nothing.  If FILE is - , read patterns from standard
          input.

The only caveat is that: reading from std::cin prevents from iterating twice on the stream, hence we can only build the function but not check its correctness nor banchmark lookup (because this requires iterating again over the keys from the stream).

roberto-trani commented 7 months ago

Hi all, I created the branch 21-support-process-substitution-to-read-keys-from-stdin to support read from standard input.

Bash pipes and redirects are now allowed using the option -i -. At the same time, process substitution works in internal memory but not in external memory because it is not possible to open the non-regular file using mmap.

zacchiro commented 7 months ago

Thanks Roberto, this is amazing!

More generally, I'm a bit confused about the usage of the expression "external memory" that pthash makes.

There is "external memory" in the sense that the work memory you need for mph building is also on disk; and there is "external memory" in the sense that you use mmap() on the original input file rather than sequential/streaming reading.

You can have either of these or both, as it is now correctly captured by the separate CLI flags --external (for the former concern) and --mmap (for the latter). I was initially confused by this while reading pthash doc, maybe it's worth a clarification (but I have no concrete suggestions at the moment, sorry).

jermp commented 7 months ago

Hi @zacchiro, just merged into master. See now the examples in the README here: https://github.com/jermp/pthash#reading-keys-from-standard-input.

Thank you again Roberto!

roberto-trani commented 7 months ago

Hi @zacchiro ,

What you say about external memory is right, and I thank you for your feedback.

I removed the --mmap parameter in this version (but it wasn't in the documentation, right?), as when reading from huge regular files, mmap is more efficient than other options. Instead, when using non regular files, piping is the right way to go. This decision simplified the development and the code.

We came up with this "agreement" because process substitution is a feature of bash, zsh and ksh (and possibly others, I don't know) that isn't POSIX and shouldn't be used in portable code. Therefore, the only edge case when using --external is with process substitution, but in that case you can use pipe (e.g., zcat file.gz | ./buidl .... -i -).

zacchiro commented 7 months ago

Hi @zacchiro, just merged into master. See now the examples in the README here: https://github.com/jermp/pthash#reading-keys-from-standard-input.

Thank you again Roberto!

Awesome, thanks to both of you!

I confirm it works for me with --external -i - ; I've already tried a full run of it on my "small" dataset (no issues) and I'm currently running it on my big dataset.

zacchiro commented 7 months ago

Upon benchmarking, it looks like cat foo.txt | ./build -i - is about 3x slower than the equivalent ./build -i foo.txt, which makes it unusable on very large datasets (e.g., ~1.6 TiB like in my "big" dataset case). Note that here there is no fault of the decompression overhead, as I'm comparing plain file versus cat and not zcat.

Interestingly enough, this was not the case in the previous implementation even when not using --mmap, when it briefly existed. So I don't think this slowdown is due to an mmap-read versus non-mmap-read, but rather to some stream handling that is different between -i - and -i filename.

roberto-trani commented 7 months ago

Hi @zacchiro ,

Thank you again for your feedback. I did a similar benchmark on 50M strings using the internal memory construction, where the two versions have the same implementation. I will report the whole time (measured using time) and the total construction time.

The outcomes of my tests are:

./build -i filename (regular file) employed 25 seconds in total and 20 seconds for construction
.build -i <(cat filename) (process substitution) employed 25 seconds in total and 20 seconds for construction
cat filename | .build -i - (pipe) employed 45 seconds in total and 20 seconds for construction

It means that most of the time is spent in inter-process communication when using pipe, because there are no implementation differences from our side in the three tested cases (there would be only in external memory where -i filename employs a mmap).

roberto-trani commented 7 months ago

I tried the same kind of tests in external memory and I found that:

on my laptop, there is no significant difference between mmap and istream on 50M strings;
using istream, I achieved the same results I had in internal memory (with -i filename and -i <(cat filename) equally fast and pipe much slower than previous approaches).

The problem of istream (and process substitution) is that it will not be possible to seek at the beginning of the "file" to perform check and lookup tests after the construction.

zacchiro commented 7 months ago

I agree with your analysis about IPC being the bottleneck here.

So the ideal combination for my use case would be process substitution + external memory, (because my dataset is too big for process substitution + internal memory). But the combination in question is currently not supported because if the input file is not "-" the build tool will try to reread it again, right? Or am I missing something?

Cheers -- Sent from my mobile phone. Please excuse my brevity and top-posting.

jermp commented 7 months ago

So the ideal combination for my use case would be process substitution + external memory,

Why not trying first zstdcat file.zstd | ./build (...)? If IO or IPC is the problem here...there is not much we can do about it, except for supporting iteration over zstd compressed files. But this is not something PTHash nor its build tool should support natively. Ideally, users write their own iterator and pass it to the build method of the PTHash class.

But the combination in question is currently not supported because if the input file is not "-" the build tool will try to reread it again, right?

We only read it again for check and lookup. On my Mac, I'm not able to use mmap with process substitution.

zacchiro commented 7 months ago

I fear we are talking past each other at this point, due to the high amount of details/variables floating around. But just to recap:

Why not trying first zstdcat file.zstd | ./build (...)?

I have tried this. It works. But the reading phase is too slow. It's like 3-4 hours vs 1-1.5 hours in my benchmark.

We only read it again for check and lookup. On my Mac, I'm not able to use mmap with process substitution.

That is "normal", because process substitution gives you a non regular file (a FIFO), which cannot be mmap-ed. What I still don't understand is why ./build cannot read that file without mmapping when --external is given. If the stream is effectively ready only once, that combination (process substitution + --external + no-mmap) would tick all the boxes of my use case.

jermp commented 7 months ago

That is "normal", because process substitution gives you a non regular file (a FIFO), which cannot be mmap-ed.

Yes, exactly.

What I still don't understand is why ./build cannot read that file without mmapping when --external is given.

You're right, there is no reason we should stick to mmap. Indeed if you use my past commit where the option --mmap existed and you remove the copy constructor/assignment from the class lines_iterator_wrapper, it should work for your case. I'll check this!

jermp / pthash

Support process substitution to read keys from std::in #21