snoyberg / xml

Various XML utility packages for Haskell
71 stars 64 forks source link

Normalize line endings #183

Closed poorlyknitsweater closed 1 year ago

poorlyknitsweater commented 1 year ago

XML demands that \r and \r\n are replaced by \n by the XML processor Section 2.11. This PR adds this feature.

This PR might break compatitbility for many users.

poorlyknitsweater commented 1 year ago

I benchmarked this PR against master to check how big the impact is. The links below lead to the same commit applied to master and this PR respectively:

stack bench --benchmark-arguments '--regress allocated:iters --time-limit 60' on https://github.com/poorlyknitsweater/xml/commit/70e6fc95e936315b6c5815ef5390ca05b29fe32b:

benchmarking without newline normalization
time                 706.1 ms   (667.7 ms .. 744.8 ms)
                     0.995 R²   (0.990 R² .. 0.999 R²)
mean                 753.7 ms   (738.0 ms .. 771.9 ms)
std dev              31.50 ms   (22.36 ms .. 45.97 ms)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              7.485e9    (7.485e9 .. 7.485e9)
  y                  3281.455   (2458.453 .. 3944.828)

stack bench --benchmark-arguments '--regress allocated:iters --time-limit 60' on https://github.com/poorlyknitsweater/xml/commit/8551275f8679fb86150405ee1726e500de8e9314:

benchmarking with newline normalization
time                 823.4 ms   (807.6 ms .. 832.0 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 809.0 ms   (791.0 ms .. 819.8 ms)
std dev              24.49 ms   (15.30 ms .. 36.55 ms)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              7.612e9    (7.612e9 .. 7.612e9)
  y                  3253.236   (2333.607 .. 3973.035)

So the PR does have a big performance impact. I'll try to compare this against replacing takeWhile1 by something else that normalizes the line endings on-the-fly.

EDIT: The test file has random content with a high density of \r and \n characters so it's not representative at all. However I think there's a lot of room for improvement.

poorlyknitsweater commented 1 year ago

Second round of benchmarks on top of the new implementation (which replaces the takeWhile1 call by some black magic). My machine was a bit busy before so I reran the master benchmark:

stack bench --benchmark-arguments '--regress allocated:iters --time-limit 60' on https://github.com/poorlyknitsweater/xml/commit/70e6fc95e936315b6c5815ef5390ca05b29fe32b:

benchmarking without newline normalization
time                 601.0 ms   (592.4 ms .. 607.2 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 593.6 ms   (586.1 ms .. 598.5 ms)
std dev              11.12 ms   (6.234 ms .. 17.75 ms)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              7.485e9    (7.485e9 .. 7.485e9)
  y                  3231.385   (2455.370 .. 3880.165)

stack bench --benchmark-arguments '--regress allocated:iters --time-limit 60' on https://github.com/poorlyknitsweater/xml/commit/b05ba779b6f9b60e788429d1feef2d665e128421:

benchmarking with newline normalization                                                       
time                 660.7 ms   (656.1 ms .. 666.4 ms)                                        
                     1.000 R²   (1.000 R² .. 1.000 R²)                                        
mean                 649.0 ms   (637.2 ms .. 654.3 ms)                                        
std dev              14.35 ms   (5.778 ms .. 25.00 ms)                                        
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)                                        
  iters              7.932e9    (7.932e9 .. 7.932e9)                                          
  y                  3223.385   (2138.726 .. 4186.241)                                        

benchmarking with newline normalization, no carriage returns                                  
time                 621.2 ms   (618.0 ms .. 624.8 ms)                                        
                     1.000 R²   (1.000 R² .. 1.000 R²)                                        
mean                 618.0 ms   (608.4 ms .. 621.0 ms)                                        
std dev              8.746 ms   (2.888 ms .. 15.82 ms)                                        
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)                                        
  iters              7.645e9    (7.645e9 .. 7.645e9)     
  y                  2757.538   (390.949 .. 5775.262)

Which is better but still a lot worse. The second benchmark here runs on the same file, without any \r characters. In this case there is still noticeable overhead, but it's not quite as bad.

poorlyknitsweater commented 1 year ago

The last change seems to reduce the overhead to basically nothing when the file contains no carriage returns:

stack bench --benchmark-arguments '--regress allocated:iters --time-limit 60' on https://github.com/poorlyknitsweater/xml/commit/79103ca80d6f9280138db01a98cc6164115b5b78:

benchmarking with newline normalization                                                       
time                 647.0 ms   (636.1 ms .. 654.1 ms)                                        
                     1.000 R²   (0.999 R² .. 1.000 R²)                                        
mean                 635.0 ms   (625.5 ms .. 639.5 ms)                                        
std dev              11.90 ms   (5.691 ms .. 19.16 ms)                                        
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)                                        
  iters              7.995e9    (7.995e9 .. 7.995e9)                                          
  y                  3114.462   (1932.848 .. 4241.436)                                        

benchmarking with newline normalization, no carriage returns                                  
time                 600.1 ms   (597.7 ms .. 603.4 ms)                                        
                     1.000 R²   (1.000 R² .. 1.000 R²)                                        
mean                 594.8 ms   (587.1 ms .. 597.8 ms)                                        
std dev              8.306 ms   (3.766 ms .. 14.28 ms)                                        
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)                                        
  iters              7.702e9    (7.702e9 .. 7.702e9)                                          
  y                  2757.538   (363.194 .. 5618.688) 
poorlyknitsweater commented 1 year ago

Would you be willing to merge at this point? If so I'd be happy to clean up the code a little and add some documentation, especially around valid '\r' == False :)

k0ral commented 1 year ago

Took me some time to wrap my head around the logic, but this seems correct to me. I'd be happy to merge this after the clean-up and documentation you mentioned, now that this complies with the "don't pay for what you don't need" principle :) .

poorlyknitsweater commented 1 year ago

I added two more unit tests and some comments. The structure isn't much different but I guess it's a little clearer. I also made this little diagram:

    ┌────────────────────────────────┐
    │consume until invalid, \r or EOF│
    └────────────────┬───────────────┘
                     │
                ┌────▼────┐             ┌────┐
┌────┬──────────►peek next├─────────────►done│
│    │          └────┬────┘ invalid/EOF └────┘
│    │               │
│    │\n           \r│
│    │               │
│   ┌┴───────────────▼───────────────┐
│   │discard \r                      │
│   │consume until invalid, \r or EOF│
│   │inspect first consumed char     │
│   └────────────────┬───────────────┘
│                    │otherwise
│               ┌────▼──┐
└───────────────┤emit \n│
                └───────┘

It's not in the code yet though.

k0ral commented 1 year ago

Thanks for the additional comments and tests. I'm merging this as is, and will apply the suggested simplifications afterwards.

k0ral commented 1 year ago

@poorlyknitsweater Released as 1.9.1.2.