Add benchmarking for inference notebooks

Shreyanand commented 2 years ago

For us to evaluate sparsification results, we need to evaluate the performance of each inference step: relevance and kpi-extraction.

As a part of this issue, create a notebook called benchmarks.ipynb in the demo2 directory. This notebook will load the relevance model and the kpi extraction model and infer from large number of pdfs (145 samples).
Results should look something like this:
- Relevance: [t1, t2, ... t145] distribution of 145 inference times; find it's min, mean, max, std
- KPI extraction: [t1, t2, ... t145] distribution of 145 inference times; find it's min, mean, max, std
- This should borrow inference code from infer_relevance and infer_kpi notebooks.
Second, get the performance metrics for each model. Look at the end of train_relevance, and train_kpi_extraction, find and borrow relevant code and the test dataset to get performance metrics: f1 score, recall, precision, and accuracy.
Results should look something like this:
- Relevance: f1 score, recall, precision, and accuracy (on the test set ~30 files assuming 80,20 split; double check this bit)
- KPI extraction: f1 score, recall, precision, and accuracy '
Print the model size in MBs for both the models

Shreyanand commented 2 years ago

@rishirich please add any updates for the approach you are taking to solve this issue here.

rishirich commented 2 years ago

@Shreyanand I was trying to take direct measurements of times taken by each PDF, but then figured that the time taken per PDF is dictated by the number of pages and its text density in general. I did a deep-dive into the code and checked how the data was gathered and chunking was done on it. I think a more accurate way of benchmarking would be to create chunks out of each individual page (chunk = number of questions X number of paragraphs), run inferencing on that chunk (i.e. a page), and then proceed to the next. Once all the pages in the PDF are processed, we can note the average time taken by every page in that pdf. A benefit of this method would be that the density of text within each page of the PDF gets considered.

We can then run this for all the PDFs, get average inferencing time per page per PDF per question, and collect all the averages for PDFs and check the mean, min, max, and std of these average times. This way, we get to consider the average density of text per page per PDF, and the varying sizes (number of PDFs) won't dictate the average inferencing times per PDF.

After this, for any particular PDF, we multiply this average by the number of pages in that PDF to get the expected inferencing time, and can also record the actual inferencing time.

MichaelTiemannOSC commented 2 years ago

We discussed in the Data Extraction weekly meeting that the extractor's pattern for recognizing paragraphs (a newline or perhaps a pair of newlines) was creating pessimal results for CDP documents where a paragraph is a short sentence "State the global scope 1 CO2 emissions (in megatons)" and the answer is even shorter ("1000"). Many small paragraphs are NOT conducive either to its method of extraction, as well as creating lots of fruitless paragraphs to search. The team will try a new approach of using a question number (a regexp that would match (C4.1a, C4.2, etc) and would treat all the text between as sentences. This will both create a lot more context and greatly reduce the number of paragraphs that have to be searched.

Bottom line: number of "paragraphs" as well as pages should be measured.

MichaelTiemannOSC commented 2 years ago

@DaBeIDS @MichaelTiemannOSC for visibility

os-climate / aicoe-osc-demo

Add benchmarking for inference notebooks #184