mgree commented 4 years ago

I've reduced some breaking benchmarks to the following minimal example, running 1.5.6.2:

import Criterion.Main

main :: IO ()
main = defaultMain
  [ bgroup (unlines ["this group", "has", "newlines", "in the group name"])
    [ bench (unlines ["this test", "has", "newlines", "in the test name"]) $
      whnf not True
    ]
  ]

The results HTML has syntax error when the string with newlines is treated as an ordinary JS string, without escaping the newlines. (In my generated bad.html, below, the first is at line 520; grep for PROBLEM.)

```HTML criterion report

criterion performance measurements

overview

want to understand this report?

this group has newlines in the group name /this test has newlines in the test name

	lower bound	estimate	upper bound
OLS regression	xxx	xxx	xxx
R² goodness-of-fit	xxx	xxx	xxx
Mean execution time	3.843392356200714e-9	3.918487349772111e-9	4.038613783941385e-9
Standard deviation	2.1364723382882635e-10	3.1496713038128253e-10	4.845850781519472e-10

Outlying measurements have severe (0.8894189024471889%) effect on estimated standard deviation.

understanding this report

In this report, each function benchmarked by criterion is assigned a section of its own. The charts in each section are active; if you hover your mouse over data points and annotations, you will see more details.

The chart on the left is a kernel density estimate (also known as a KDE) of time measurements. This graphs the probability of any given time measurement occurring. A spike indicates that a measurement of a particular time occurred; its height indicates how often that measurement was repeated.
The chart on the right is the raw data from which the kernel density estimate is built. The x axis indicates the number of loop iterations, while the y axis shows measured execution time for the given number of loop iterations. The line behind the values is the linear regression prediction of execution time for a given number of iterations. Ideally, all measurements will be on (or very near) this line.

Under the charts is a small table. The first two rows are the results of a linear regression run on the measurements displayed in the right-hand chart.

OLS regression indicates the time estimated for a single loop iteration using an ordinary least-squares regression model. This number is more accurate than the mean estimate below it, as it more effectively eliminates measurement overhead and other constant factors.
R² goodness-of-fit is a measure of how accurately the linear regression model fits the observed measurements. If the measurements are not too noisy, R² should lie between 0.99 and 1, indicating an excellent fit. If the number is below 0.99, something is confounding the accuracy of the linear model.
Mean execution time and standard deviation are statistics calculated from execution time divided by number of iterations.

We use a statistical technique called the bootstrap to provide confidence intervals on our estimates. The bootstrap-derived upper and lower bounds on estimates let you see how accurate we believe those estimates to be. (Hover the mouse over the table headers to see the confidence levels.)

A noisy benchmarking environment can cause some or many measurements to fall far from the mean. These outlying measurements can have a significant inflationary effect on the estimate of the standard deviation. We calculate and display an estimate of the extent to which the standard deviation has been inflated by outliers.

```

RyanGlScott commented 4 years ago

Just to make sure: how would you expect this to be rendered? I could certainly envision doing some sort of manual JavaScript string sanitizing so that this:

  var benches = ["this group
has
newlines
in the group name
/this test
has
newlines
in the test name
",];

Is displayed like this instead:

  var benches = ["this group" +
"has" +
"newlines" +
"in the group name" +
"/this test" +
"has" +
"newlines" +
"in the test name" +
"",];

However, it is worth noting that after doing so, the criterion report will ultimately show up like this:

criterion

This, ultimately, is no different than if you had wrote the group/title names without newlines. Is that what you would expect? (I'm not sure if you had a particular reason for using unlines in your code.)

mgree commented 4 years ago

I don't think I had particular expectations here! I noticed the issue when I was naming tests after Show instances, including one that debug-printed automata as sets of states and transition relations (with newlines).

I've worked around it in my code (showing the regex that generated the automaton instead). Maybe turn newlines into spaces, emitting a warning when --output is set that such test names will have newlines removed?

RyanGlScott commented 4 years ago

225 changes the report generator so that it will emit a warning if a report name contains newlines, like so:

benchmarking this group
has
newlines
in the group name
/this test
has
newlines
in the test name

time                 4.752 ns   (4.751 ns .. 4.753 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 4.752 ns   (4.751 ns .. 4.754 ns)
std dev              4.173 ps   (2.539 ps .. 7.271 ps)

criterion: warning:
  Report name "this group\nhas\nnewlines\nin the group name\n/this test\nhas\nnewlines\nin the test name\n" contains newlines, which will be replaced with spaces in the HTML report.

Would you consider this a satisfactory fix to this issue?

mgree commented 4 years ago

Beautiful---thanks for the quick fix! :)

haskell / criterion

Newlines in bgroup or benchmarks break HTML output #224

criterion performance measurements

overview

this group has newlines in the group name /this test has newlines in the test name

understanding this report

colophon

225 changes the report generator so that it will emit a warning if a report name contains newlines, like so: