mikemccand / luceneutil

Various utility scripts for running Lucene performance tests
Apache License 2.0
205 stars 115 forks source link

add stored fields benchmark. #145

Closed rmuir closed 3 years ago

rmuir commented 3 years ago

Currently nothing really prevents regressions in the stored fields.

See discussion: https://issues.apache.org/jira/browse/LUCENE-10170

Here's my geonames-based benchmark, that i wrote up the last time I had to debug the stored fields. Maybe we can get it into a PR and hooked in somehow to a chart in the nightly bench.

import java.io.InputStreamReader;
import java.io.LineNumberReader;
import java.nio.file.Files;
import java.nio.file.Paths;

import org.apache.lucene.codecs.lucene90.Lucene90Codec;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StoredField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.SerialMergeScheduler;
import org.apache.lucene.store.FSDirectory;

public class Indexer {
  public static void main(String args[]) throws Exception {    
    FSDirectory dir = FSDirectory.open(Paths.get("/tmp/index"));
    for (String file : dir.listAll()) {
      dir.deleteFile(file);
    }
    IndexWriterConfig iwc = new IndexWriterConfig();
    iwc.setCodec(new Lucene90Codec(Lucene90Codec.Mode.BEST_COMPRESSION));
    iwc.setMergeScheduler(new SerialMergeScheduler());
    IndexWriter iw = new IndexWriter(dir, iwc);
    LineNumberReader reader = new LineNumberReader(new InputStreamReader(Files.newInputStream(Paths.get("/home/rmuir/workspace/IndexGeonames/src/allCountries.txt"))));
    long t0 = System.currentTimeMillis();
    indexDocs(iw, reader);
    System.out.println("time: " + (System.currentTimeMillis() - t0));
    reader.close();
    iw.close();
    dir.close();
  }

  static void indexDocs(IndexWriter iw, LineNumberReader reader) throws Exception {
    Document doc = new Document();
    Field fields[] = new Field[19];
    for (int i = 0; i < fields.length; i++) {
      fields[i] = new StoredField("field " + i, "");
      doc.add(fields[i]);
    }

    String line = null;
    while ((line = reader.readLine()) != null) {
      if (reader.getLineNumber() % 1000 == 0) {
        System.out.println("doc: " + reader.getLineNumber());
      }
      if (reader.getLineNumber() == 100000) {
        break;
      }
      String values[] = line.split("\t");
      if (values.length != fields.length) {
        throw new RuntimeException("bogus: " + values);
      }
      for (int i = 0; i < values.length; i++) {
        fields[i].setStringValue(values[i]);
      }
      iw.addDocument(doc);
      iw.flush();
    }
  }
}
msokolov commented 3 years ago

Just noting that benchUtil.py does have these options: {{-store}} ("store body fields") and {{-loadStoredFields}}. Maybe they could be used along with other existing tasks? Still, a dedicated test for stored fields would probably be better so you could separate out the impacts.

rmuir commented 3 years ago

@msokolov I think the issue with enabling stored fields for existing tasks, is that they easily get "drowned out" by other features: this applies to index time, merge time, and retrieval time.

If things like postings lists, points, docvalues, vectors, etc are enabled, we may see no change to such benchmarks when there is a performance regression to the stored fields.

Additionally, stored fields have tricky bulk merge optimizations that can also hide bad performance.

That's why I wrote a standalone benchmark to target the relevant code when debugging https://issues.apache.org/jira/browse/LUCENE-9827. It ran over 10x faster for BEST_SPEED after we fixed the issue! (24x faster with BEST_COMPRESSION).

The benchmark does a few things to zero in on stored fields perf:

mikemccand commented 3 years ago

I think as a start we should simple run @rmuir's cool pure stored fields benchmarks in nightly benchmarks? We could run the benchmark (about how long does it take?) and record the resulting index size. It's a good start. But it is indexing only?

Maybe instead of flushing after each doc we could let IW do its thing, and then we could (fairly) report indexing throughput too, so we could see impact of changes that improve indexing performance for stored fields.

Also I assume you are granting ASL2 for it :) Thank you!

jpountz commented 3 years ago

we could let IW do its thing

Unfortunately this would most likely create a single segment since stored fields don't consume per-document heap in IndexingChain, so this would miss to catch any merge-time regressions like the one that Robert created this benchmark for.

rmuir commented 3 years ago

It is true that the benchmark is indexing only, but that's also where the code complexity is: especially merging! And yeah, in such benchmarks if we want merging to happen, we kinda have to force it to happen explicitly.

There are also other variants of the benchmark we could use in the future:

But +1 to start with something.

mikemccand commented 3 years ago

Thank you for all the progress here @jpountz and @rmuir! I'll take a stab at Python wrapper to get these results into each nightly run.

jpountz commented 3 years ago

It's been added to nightly benchmarks so it looks like we are all set?