Improve concurrency and space efficiency of parse_collection

mpetri commented 4 years ago

Describe the solution you'd like The parse_collection command seems to have several potential areas of improvements:

"Remapping IDs" could be done in parallel
"Concatenating batches" and "Remapping IDs" could be combined into one step. Currently remapping ids rewrites the content on disk only to write it again in the "Concatenating batches" step.
The run() method could use an unordered_map instead of a map or is there a reason for this? We are remapping later anyway?
After remapping/concatenating a batch the original file could be deleted right away instead of only at the end.

Some of these steps would prohibit resuming the merging process but would reduce parsing time and half the space usage.

mpetri commented 4 years ago

To put some context into this:

I was creating a collection and processing the documents (where it outputs [Batch 1234] Processed documents) took the same time as the "remapping IDs" part of the parsing.

mpetri commented 4 years ago

Another idea: You could even concatenate in parallel as you know where the data for the different batches will end up in the big file.

elshize commented 4 years ago

I agree, there is much to be improved. I could only spare a moment to look at it for now, but from what I can tell:

I don't see the reason to use map either, maybe it was a remnant of an earlier version. unordered_map should do just fine. A little side note: we had a discussion (or more like started it) about using an actually efficient hash map. unordered_map is garbage, to put it simply. But we never arrived at the preferred replacement, so that's a still an open question.
Doing remapping could be probably done directly into the final file. I didn't dive deep into it now, but I think we should be able to know where things go and do both things in parallel as you mention in the last comment.
Speaking of parallel, we should think about how to deal with it. Technically, it's pretty simple, you just write wherever you need to in separate threads. I'm just not sure what would happen if we use too many threads at a time and completely overwhelm I/O. We might not do ourselves any favors by doing so.
I think the slowness of remapping has to do with the fact we're directly writing to mapped files (not sure, though). I think some kind of smart buffering could help with that.

Qustion 1: How long did concatenation take? I since these are bigger chunks, I'm expecting it not to be anything close to remapping.

Now, regarding the disk space, deleting things right after merging is easy, just not sure if it helps you a lot if we merge directly into one big file, because you'll peek at double the size of the index anyway, you'd just get rid of some of it sooner. That's also why I'm asking about the concatenation. If it's not too expensive, we might want to leave it be.

Qustion 2: How big is the collection you're trying to index?

Conclusion

There are definitely several things we can do here, some easier than the others. Say, replacing the map should be easy, same with eager deleting of the batches (maybe we should leave an option to not do that with an additional flag?). Parallel remapping can be done easily enough but should be checked for performance when using large numbers of threads.

The more complex stuff that requires some thought would be to try buffering in memory when remapping and flush every now and then. I'm expecting this to speed things up, but I'm not sure how exactly the OS is flushing stuff to the memory mapped file, so can't be sure.

Lastly, we're stretched a little bit thin at the moment, so not sure when someone can pick these up. However, we always welcome pull requests. We will also be grateful for any insights about what's slow, what's fast and what other potentially pertinent things you encounter during the build.

mpetri commented 4 years ago

The collection was ~400M docs in plaintext format, processed using 60 threads. The rough times for the different steps are:

4.3 hours - Processed documents/records 
  2 mins  - Merging titles
  2 mins  - Creating document lexicon
  0 secs  - Merging URLs
1.5 hours - Collecting terms
  2 mins  - Writing terms
 10 mins  - Mapping terms
4.2 hours - Remapping IDs
4.6 hours - Concatenating batches

elshize commented 4 years ago

Okay, that's helpful, thanks. We definitely have to improve on the later stages.

mpetri commented 4 years ago

I would like to work on this but parallelizing these things in c++ is quite a bit more awkward than in languages like rust.

mpetri commented 4 years ago

The main issue was that the whole process was neither CPU or IO bound. There was plenty of I/O capacity and for most parts the CPU was also barely used. That's why I opened this issue.

mpetri commented 4 years ago

As a side question. Would it be faster/better to do lucene -> CIFF -> pisa?

elshize commented 4 years ago

I would like to work on this but parallelizing these things in c++ is quite a bit more awkward than in languages like rust.

It's actually not as bad with TBB, you can either do parallel loops, thread pools, etc. There are some examples throughout the codebase.

The main issue was that the whole process was neither CPU or IO bound. There was plenty of I/O capacity and for most parts the CPU was also barely used. That's why I opened this issue.

Running merging part in one thread will cause that for sure: only one thread is doing the work and waits a lot for I/O. We could think of doing it asynchronously, or just simply pass the computed data to a writer thread. But I think there's more we can do.

As a side question. Would it be faster/better to do lucene -> CIFF -> pisa?

I don't think anybody has ever checked that, I certainly haven't. I have no idea how fast Lucene is for one. Also, I don't have a good guess how fast ciff to PISA will be. Maybe @JMMackenzie will have some guesses. The conversion is not multi-threaded though, but it is written in Rust, so you might wanna take a look. But I'm not sure if protobuf allows random access or if it implements parallel iterator or something. If not, then you'd have to read and serve data to workers to write it back to another format, so I'm not sure how much there is to gain.

JMMackenzie commented 4 years ago

As a side question. Would it be faster/better to do lucene -> CIFF -> pisa?

I don't think anybody has ever checked that, I certainly haven't. I have no idea how fast Lucene is for one. Also, I don't have a good guess how fast ciff to PISA will be. Maybe @JMMackenzie will have some guesses. The conversion is not multi-threaded though, but it is written in Rust, so you might wanna take a look. But I'm not sure if protobuf allows random access or if it implements parallel iterator or something. If not, then you'd have to read and serve data to workers to write it back to another format, so I'm not sure how much there is to gain.

Oops, scrubbing what I had here as I had done "Pisa to CIFF" and not the other way. I am sure I had timings somewhere but I can't find them right now.

Looking at Anserini/Lucene (this paper from Jimmy Lin: https://arxiv.org/pdf/1910.11028.pdf) it seems like Lucene can handle ClueWeb12B in around 2 hours (with full positions and collection storage) - it's probably under an hour for a regular index.

I estimate the Lucene -> CIFF part takes around an hour but I can't remember unfortunately.

JMMackenzie commented 4 years ago

Jimmy Lin has kindly managed to dig up some information from his personal Mac Pro (SSD):

time target/appassembler/bin/ExportAnseriniLuceneIndex -output cw12b-complete-20200309.ciff.gz
 -index lucene-index-ciff.cw12b.20200309 -description "Anserini v0.7.2, ClueWeb12-B13 regression"
real    167m21.523s
user    283m30.397s
sys 4m52.060s

So that's the time taken to export Lucene to CIFF (it runs single threaded).

lintool commented 4 years ago

Unless there are scientifically interesting questions you want to explore, I would advocate just giving up on indexing and letting Anserini/Lucene do it for you via CIFF. Why waste engineering cycles on this?

My entire argument in https://arxiv.org/pdf/1910.11028.pdf is that we've pushed indexing to basically its limits with current in-memory inversion/flush to disk/merge approach. Unless we dramatically rethink the pipeline, we've reached a plateau. Just buy more SSDs to go faster.

elshize commented 4 years ago

Yep, I think this might be the best approach.

lintool commented 4 years ago

Also, we're one issue away from the entire indexing pipeline in Anserini from being pip installable: https://github.com/castorini/pyserini/issues/77

Something like:

$ pip install pyserini
...
$ python -m pyserini.index --collection cw12b --path /path/to/collection ...

Bam!

If you want, we can move the CIFF exporter into pyserini as well, and everything will be pip-able. You can script it easily from your end as well.

elshize commented 4 years ago

That would be definitely nice to be able to easily index and export it from python like that. We could then include it as an option in our documentation.

mpetri commented 4 years ago

I'm mostly looking for easy of use and speed. I will try pyserini -> ciff -> pisa and report some statistics when I'm done.

Maybe the docs should reflect that there are other options to create an index.

elshize commented 4 years ago

Yes, we should mention CIFF in the docs.

JMMackenzie commented 4 years ago

Just to follow up briefly, I ran a few quick-n-dirty Gov2 experiments.

Indexing Gov2 via Anserini takes ~20 mins with 30 threads. Taking this Lucene index to CIFF takes about ~45-60 mins. Taking CIFF to PISA canonical takes ~15 mins.

End-to-end raw corpus to files that PISA can deal with (but not yet query over) is ~2 hours max. This is also on a loaded machine, so it's probably ultimately faster than this depending on your resources. The bottleneck is the single-threaded "Lucene to CIFF" part which populates the protobuf from the Lucene index. Not much you can do to speed that up, but as a once-off operation it's probably fine.

elshize commented 4 years ago

@JMMackenzie thanks for taking the time to check that out. Sounds like it's altogether faster than with PISA alone. One side note is that you won't end up with a full forward index, just the inverted index and term and document lexicons, but that might not be important in majority of cases.

mpetri commented 4 years ago

I just wanted to confirm lucene -> ciff -> pisa is much faster than the parse_collection tool provided by pisa.

I think this issue can be closed but maybe lucene+ciff should be the default way of creating a pisa index?

JMMackenzie commented 4 years ago

Thanks Matthias, I guess we should at least document the tooling so users have a choice, including how to get queries formatted via Anserini (etc).

elshize commented 4 years ago

Thanks for confirmation. Let's keep it open for now. We definitely want to improve some things around parsing, especially since some of them shouldn't be too hard. The CIFF route is good to have but having native parsing capabilities has its own advantages, so we definitely want to see what can be improved.

That said, I don't think CIFF is mentioned anywhere in the docs, so we should definitely mention it.

pisa-engine / pisa

Improve concurrency and space efficiency of parse_collection #403

Conclusion