Open mpetri opened 4 years ago
To put some context into this:
I was creating a collection and processing the documents (where it outputs [Batch 1234] Processed documents) took the same time as the "remapping IDs" part of the parsing.
Another idea: You could even concatenate in parallel as you know where the data for the different batches will end up in the big file.
I agree, there is much to be improved. I could only spare a moment to look at it for now, but from what I can tell:
map
either, maybe it was a remnant of an earlier version. unordered_map
should do just fine. A little side note: we had a discussion (or more like started it) about using an actually efficient hash map. unordered_map
is garbage, to put it simply. But we never arrived at the preferred replacement, so that's a still an open question.Qustion 1: How long did concatenation take? I since these are bigger chunks, I'm expecting it not to be anything close to remapping.
Now, regarding the disk space, deleting things right after merging is easy, just not sure if it helps you a lot if we merge directly into one big file, because you'll peek at double the size of the index anyway, you'd just get rid of some of it sooner. That's also why I'm asking about the concatenation. If it's not too expensive, we might want to leave it be.
Qustion 2: How big is the collection you're trying to index?
There are definitely several things we can do here, some easier than the others. Say, replacing the map should be easy, same with eager deleting of the batches (maybe we should leave an option to not do that with an additional flag?). Parallel remapping can be done easily enough but should be checked for performance when using large numbers of threads.
The more complex stuff that requires some thought would be to try buffering in memory when remapping and flush every now and then. I'm expecting this to speed things up, but I'm not sure how exactly the OS is flushing stuff to the memory mapped file, so can't be sure.
Lastly, we're stretched a little bit thin at the moment, so not sure when someone can pick these up. However, we always welcome pull requests. We will also be grateful for any insights about what's slow, what's fast and what other potentially pertinent things you encounter during the build.
The collection was ~400M docs in plaintext format, processed using 60 threads. The rough times for the different steps are:
4.3 hours - Processed documents/records
2 mins - Merging titles
2 mins - Creating document lexicon
0 secs - Merging URLs
1.5 hours - Collecting terms
2 mins - Writing terms
10 mins - Mapping terms
4.2 hours - Remapping IDs
4.6 hours - Concatenating batches
Okay, that's helpful, thanks. We definitely have to improve on the later stages.
I would like to work on this but parallelizing these things in c++ is quite a bit more awkward than in languages like rust.
The main issue was that the whole process was neither CPU or IO bound. There was plenty of I/O capacity and for most parts the CPU was also barely used. That's why I opened this issue.
As a side question. Would it be faster/better to do lucene -> CIFF -> pisa?
I would like to work on this but parallelizing these things in c++ is quite a bit more awkward than in languages like rust.
It's actually not as bad with TBB, you can either do parallel loops, thread pools, etc. There are some examples throughout the codebase.
The main issue was that the whole process was neither CPU or IO bound. There was plenty of I/O capacity and for most parts the CPU was also barely used. That's why I opened this issue.
Running merging part in one thread will cause that for sure: only one thread is doing the work and waits a lot for I/O. We could think of doing it asynchronously, or just simply pass the computed data to a writer thread. But I think there's more we can do.
As a side question. Would it be faster/better to do lucene -> CIFF -> pisa?
I don't think anybody has ever checked that, I certainly haven't. I have no idea how fast Lucene is for one. Also, I don't have a good guess how fast ciff to PISA will be. Maybe @JMMackenzie will have some guesses. The conversion is not multi-threaded though, but it is written in Rust, so you might wanna take a look. But I'm not sure if protobuf allows random access or if it implements parallel iterator or something. If not, then you'd have to read and serve data to workers to write it back to another format, so I'm not sure how much there is to gain.
As a side question. Would it be faster/better to do lucene -> CIFF -> pisa?
I don't think anybody has ever checked that, I certainly haven't. I have no idea how fast Lucene is for one. Also, I don't have a good guess how fast ciff to PISA will be. Maybe @JMMackenzie will have some guesses. The conversion is not multi-threaded though, but it is written in Rust, so you might wanna take a look. But I'm not sure if protobuf allows random access or if it implements parallel iterator or something. If not, then you'd have to read and serve data to workers to write it back to another format, so I'm not sure how much there is to gain.
Oops, scrubbing what I had here as I had done "Pisa to CIFF" and not the other way. I am sure I had timings somewhere but I can't find them right now.
Looking at Anserini/Lucene (this paper from Jimmy Lin: https://arxiv.org/pdf/1910.11028.pdf) it seems like Lucene can handle ClueWeb12B in around 2 hours (with full positions and collection storage) - it's probably under an hour for a regular index.
I estimate the Lucene -> CIFF part takes around an hour but I can't remember unfortunately.
Jimmy Lin has kindly managed to dig up some information from his personal Mac Pro (SSD):
time target/appassembler/bin/ExportAnseriniLuceneIndex -output cw12b-complete-20200309.ciff.gz
-index lucene-index-ciff.cw12b.20200309 -description "Anserini v0.7.2, ClueWeb12-B13 regression"
real 167m21.523s
user 283m30.397s
sys 4m52.060s
So that's the time taken to export Lucene to CIFF (it runs single threaded).
Unless there are scientifically interesting questions you want to explore, I would advocate just giving up on indexing and letting Anserini/Lucene do it for you via CIFF. Why waste engineering cycles on this?
My entire argument in https://arxiv.org/pdf/1910.11028.pdf is that we've pushed indexing to basically its limits with current in-memory inversion/flush to disk/merge approach. Unless we dramatically rethink the pipeline, we've reached a plateau. Just buy more SSDs to go faster.
Yep, I think this might be the best approach.
Also, we're one issue away from the entire indexing pipeline in Anserini from being pip installable: https://github.com/castorini/pyserini/issues/77
Something like:
$ pip install pyserini
...
$ python -m pyserini.index --collection cw12b --path /path/to/collection ...
Bam!
If you want, we can move the CIFF exporter into pyserini as well, and everything will be pip-able. You can script it easily from your end as well.
That would be definitely nice to be able to easily index and export it from python like that. We could then include it as an option in our documentation.
I'm mostly looking for easy of use and speed. I will try pyserini -> ciff -> pisa and report some statistics when I'm done.
Maybe the docs should reflect that there are other options to create an index.
Yes, we should mention CIFF in the docs.
Just to follow up briefly, I ran a few quick-n-dirty Gov2 experiments.
Indexing Gov2 via Anserini takes ~20 mins with 30 threads. Taking this Lucene index to CIFF takes about ~45-60 mins. Taking CIFF to PISA canonical takes ~15 mins.
End-to-end raw corpus to files that PISA can deal with (but not yet query over) is ~2 hours max. This is also on a loaded machine, so it's probably ultimately faster than this depending on your resources. The bottleneck is the single-threaded "Lucene to CIFF" part which populates the protobuf from the Lucene index. Not much you can do to speed that up, but as a once-off operation it's probably fine.
@JMMackenzie thanks for taking the time to check that out. Sounds like it's altogether faster than with PISA alone. One side note is that you won't end up with a full forward index, just the inverted index and term and document lexicons, but that might not be important in majority of cases.
I just wanted to confirm lucene -> ciff -> pisa is much faster than the parse_collection
tool provided by pisa.
I think this issue can be closed but maybe lucene+ciff should be the default way of creating a pisa index?
Thanks Matthias, I guess we should at least document the tooling so users have a choice, including how to get queries formatted via Anserini (etc).
Thanks for confirmation. Let's keep it open for now. We definitely want to improve some things around parsing, especially since some of them shouldn't be too hard. The CIFF route is good to have but having native parsing capabilities has its own advantages, so we definitely want to see what can be improved.
That said, I don't think CIFF is mentioned anywhere in the docs, so we should definitely mention it.
Describe the solution you'd like The parse_collection command seems to have several potential areas of improvements:
"Remapping IDs" could be done in parallel
"Concatenating batches" and "Remapping IDs" could be combined into one step. Currently remapping ids rewrites the content on disk only to write it again in the "Concatenating batches" step.
The
run()
method could use anunordered_map
instead of amap
or is there a reason for this? We are remapping later anyway?After remapping/concatenating a batch the original file could be deleted right away instead of only at the end.
Some of these steps would prohibit resuming the merging process but would reduce parsing time and half the space usage.