osirrc / jig

Jig for the Open-Source IR Replicability Challenge (OSIRRC)
13 stars 3 forks source link

Efficiency measurements (how to gather?) #23

Open lintool opened 5 years ago

lintool commented 5 years ago

hi @andrewtrotman can you think about how you'd like the jig to report efficiency metrics? I see a few options:

  1. jig could record it - but would be coarse grained
  2. the image itself could record it - but relay the metrics back to the jig in some standard format

Both have their advantages and disadvantages... thoughts?

amallia commented 5 years ago

In terms of efficiency, in my opinion, it is fundamental to measure the size of the index. Being able to store additional data can definitely improve query processing speed and quality, which, in turn, corresponds to higher main memory usage utilization.

Should we have specific hard limits (memory, space, CPUs..) on the Docker instance running? For example, by forcing the container to run on a single CPU we ensure that ad-hoc retrieval runs on a single core too.

Moreover, is efficiency only related to query processing? Is efficiency of indexing relevant at all?

andrewtrotman commented 5 years ago

Efficiency essentially breaks down into efficiency of space and efficiency of time.

In the case of the indexer I think we can just use the output of the UNIX time command to tell us how long it took to build the index. If the indexer also reports time it would be interesting to see how the two compare. We can use the UNIX ls command to see how large the index is, but the indexer will need to tell us where to look.

For the search I think the 250 topics we have is way too small for measuring search time. The brief test I ran suggested that some of those topics will take near-enough to 0 time. So I think we should use the 10,000 topics from the TREC Million Query Track (or 20,000 if we use both years). I'd like to compare what the search engine claims against what the UNIX time command claims. Sure, UNIX time will include start-up, shut-down, and index-load time, but that is why we also need to look at what the search engine claims.

So we need, I think, a "spec":

Nothing really for indexing (is there?), just agreement on a single line of output that states where the index can be found so that we can start the container and "ls" to get the index size. We can easily change the jig to call UNIX time command.

For search, we need to agree on when we start the timer, and when it ends, and what we are measuring (throughput or latency). We can turn throughput into latency by setting the thread count to 1. So lets measure throughput. I think we start the timer the last possible moment before the first query and stop the timer at the first possible moment after we complete the last query. As we all have the same I/O demands when it comes to producing the TREC run file, we could agree to include or exclude that time - thoughts please.

frrncl commented 5 years ago

Hi,

What about indexing time in the case of ML stuff? Should we break it down into training, validation, ...? Also, do we need some break down on the idea of index size in this case?

Nicola

andrewtrotman commented 5 years ago

Agreed - we need to measure the efficiency of the ML stuff. I'm hoping there's a change to do the ML stuff before indexing because I want to learn the best solution then bake it into my index.

albpurpura commented 5 years ago

NVSM performs indexing before training and validation. I think indexing could be a separate step from training and test, also for NeuIR models. Training, validation and test are performed on different subsets of topics specified by the user (without cross validation). To summarize, the steps we consider are: 1) indexing 2) training and validation (with early stopping) 3) test What do you think of this sequence of steps? Can we adopt this also for other NeuIR models?

cmacdonald commented 5 years ago

The (nuclear) alternative would be for efficiency to be measured using the jig by sending queries on stdin (one by one).

In any case, I agree that we should record the number of cores & threads involved in both retrieval and indexing, so we get like-for-like comparisons