os-climate / aicoe-osc-demo

This repository is the central location for the demos the ET data science team is developing within the OS-Climate project. This demo shows how to use the tools provided by Open Data Hub (ODH) running on the Operate First cluster to perform ETL, create training and inference pipelines.
Apache License 2.0
11 stars 24 forks source link

Add benchmarking for inference notebooks #184

Open Shreyanand opened 2 years ago

Shreyanand commented 2 years ago

For us to evaluate sparsification results, we need to evaluate the performance of each inference step: relevance and kpi-extraction.

Shreyanand commented 2 years ago

@rishirich please add any updates for the approach you are taking to solve this issue here.

rishirich commented 2 years ago

@Shreyanand I was trying to take direct measurements of times taken by each PDF, but then figured that the time taken per PDF is dictated by the number of pages and its text density in general. I did a deep-dive into the code and checked how the data was gathered and chunking was done on it. I think a more accurate way of benchmarking would be to create chunks out of each individual page (chunk = number of questions X number of paragraphs), run inferencing on that chunk (i.e. a page), and then proceed to the next. Once all the pages in the PDF are processed, we can note the average time taken by every page in that pdf. A benefit of this method would be that the density of text within each page of the PDF gets considered.

We can then run this for all the PDFs, get average inferencing time per page per PDF per question, and collect all the averages for PDFs and check the mean, min, max, and std of these average times. This way, we get to consider the average density of text per page per PDF, and the varying sizes (number of PDFs) won't dictate the average inferencing times per PDF.

After this, for any particular PDF, we multiply this average by the number of pages in that PDF to get the expected inferencing time, and can also record the actual inferencing time.

MichaelTiemannOSC commented 2 years ago

We discussed in the Data Extraction weekly meeting that the extractor's pattern for recognizing paragraphs (a newline or perhaps a pair of newlines) was creating pessimal results for CDP documents where a paragraph is a short sentence "State the global scope 1 CO2 emissions (in megatons)" and the answer is even shorter ("1000"). Many small paragraphs are NOT conducive either to its method of extraction, as well as creating lots of fruitless paragraphs to search. The team will try a new approach of using a question number (a regexp that would match (C4.1a, C4.2, etc) and would treat all the text between as sentences. This will both create a lot more context and greatly reduce the number of paragraphs that have to be searched.

Bottom line: number of "paragraphs" as well as pages should be measured.

MichaelTiemannOSC commented 2 years ago

@DaBeIDS @MichaelTiemannOSC for visibility