ocramz / xeno

Fast Haskell XML parser
Other
118 stars 32 forks source link

validate fails if there are spaces around `=` #21

Closed MartinPotier closed 3 years ago

MartinPotier commented 6 years ago

Seems like validation fails if I have something of the sort:

        <tag attr1 = "Some string" />

AFAIK, they are allowed in this Specification https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-white-space

Can I do anything to make these accepted?

This is in xeno-0.3.2 included in the LTS-10.4

qrilka commented 6 years ago

@MartinPotier isn't #19 exactly about that?

ocramz commented 6 years ago

@MartinPotier @qrilka could you try out this branch where I've merged @unhammer 's patch, and report before/after figures if you can ? https://github.com/ocramz/xeno/tree/whitespace-around-equals-%2319

MartinPotier commented 6 years ago

@MartinPotier isn't #19 exactly about that?

It is! Sorry I didn't catch this

@MartinPotier @qrilka could you try out this branch where I've merged @unhammer 's patch, and report before/after figures if you can ?

I'll try to find a way to use it, I'm not too familiar with the procedure (and still quite new to haskell tooling)

ocramz commented 6 years ago

@MartinPotier you can just run stack bench while on master and then after switching to the feature branch :)

MartinPotier commented 6 years ago

Master:

Running 2 benchmarks...
Benchmark xeno-memory-bench: RUNNING...

Case             Allocated  GCs
4kb/hexml/dom        3,808    0
4kb/xeno/sax         -,496    0
4kb/xeno/dom         7,968    0
31kb/hexml/dom      30,608    0
31kb/xeno/sax        -,928    0
31kb/xeno/dom        7,536    0
211kb/hexml/dom    211,496    0
211kb/xeno/sax      26,576    0
211kb/xeno/dom   1,043,328    0
Benchmark xeno-memory-bench: FINISH
Benchmark xeno-speed-bench: RUNNING...
benchmarking 4KB/hexml-dom
time                 12.68 μs   (11.95 μs .. 13.86 μs)
                     0.926 R²   (0.883 R² .. 0.965 R²)
mean                 14.69 μs   (13.34 μs .. 17.05 μs)
std dev              6.183 μs   (4.201 μs .. 9.620 μs)
variance introduced by outliers: 99% (severely inflated)

benchmarking 4KB/xeno-sax
time                 4.830 μs   (4.649 μs .. 5.099 μs)
                     0.969 R²   (0.950 R² .. 0.987 R²)
mean                 5.120 μs   (4.869 μs .. 5.420 μs)
std dev              939.5 ns   (732.0 ns .. 1.170 μs)
variance introduced by outliers: 96% (severely inflated)

benchmarking 4KB/xeno-dom
time                 12.92 μs   (12.06 μs .. 14.36 μs)
                     0.860 R²   (0.720 R² .. 0.969 R²)
mean                 16.37 μs   (13.88 μs .. 23.57 μs)
std dev              13.17 μs   (5.986 μs .. 23.14 μs)
variance introduced by outliers: 99% (severely inflated)

benchmarking 4KB/hexpat-sax
time                 110.8 μs   (98.79 μs .. 128.9 μs)
                     0.880 R²   (0.838 R² .. 0.927 R²)
mean                 146.0 μs   (109.8 μs .. 228.5 μs)
std dev              159.1 μs   (31.92 μs .. 272.5 μs)
variance introduced by outliers: 99% (severely inflated)

benchmarking 4KB/hexpat-dom
time                 382.4 μs   (298.4 μs .. 477.4 μs)
                     0.763 R²   (0.724 R² .. 0.932 R²)
mean                 353.0 μs   (318.9 μs .. 408.0 μs)
std dev              130.9 μs   (95.56 μs .. 183.2 μs)
variance introduced by outliers: 99% (severely inflated)

benchmarking 4KB/xml-dom
time                 2.559 ms   (1.783 ms .. 3.597 ms)
                     0.544 R²   (0.364 R² .. 0.728 R²)
mean                 3.529 ms   (3.060 ms .. 4.191 ms)
std dev              1.702 ms   (1.365 ms .. 2.075 ms)
variance introduced by outliers: 98% (severely inflated)

benchmarking 31KB/hexml-dom
time                 14.24 μs   (11.43 μs .. 17.59 μs)
                     0.668 R²   (0.527 R² .. 0.822 R²)
mean                 16.88 μs   (13.83 μs .. 22.88 μs)
std dev              12.56 μs   (7.129 μs .. 21.33 μs)
variance introduced by outliers: 99% (severely inflated)

benchmarking 31KB/xeno-sax
time                 2.018 μs   (1.908 μs .. 2.269 μs)
                     0.960 R²   (0.913 R² .. 0.999 R²)
mean                 1.940 μs   (1.895 μs .. 2.095 μs)
std dev              255.8 ns   (55.95 ns .. 531.4 ns)
variance introduced by outliers: 93% (severely inflated)

benchmarking 31KB/xeno-dom
time                 5.479 μs   (5.120 μs .. 5.902 μs)
                     0.971 R²   (0.956 R² .. 0.992 R²)
mean                 5.160 μs   (4.993 μs .. 5.429 μs)
std dev              661.4 ns   (399.2 ns .. 985.0 ns)
variance introduced by outliers: 92% (severely inflated)

benchmarking 31KB/hexpat-sax
time                 248.8 μs   (243.4 μs .. 254.9 μs)
                     0.994 R²   (0.989 R² .. 0.997 R²)
mean                 252.6 μs   (243.2 μs .. 279.8 μs)
std dev              56.43 μs   (12.19 μs .. 107.6 μs)
variance introduced by outliers: 96% (severely inflated)

benchmarking 31KB/hexpat-dom
time                 293.7 μs   (268.3 μs .. 336.0 μs)
                     0.912 R²   (0.840 R² .. 0.997 R²)
mean                 273.4 μs   (264.2 μs .. 295.9 μs)
std dev              48.93 μs   (13.13 μs .. 92.63 μs)
variance introduced by outliers: 92% (severely inflated)

benchmarking 31KB/xml-dom
time                 12.43 ms   (11.61 ms .. 13.55 ms)
                     0.969 R²   (0.947 R² .. 0.992 R²)
mean                 11.99 ms   (11.68 ms .. 12.52 ms)
std dev              1.050 ms   (796.8 μs .. 1.478 ms)
variance introduced by outliers: 46% (moderately inflated)

benchmarking 211KB/hexml-dom
time                 255.4 μs   (240.2 μs .. 280.3 μs)
                     0.980 R²   (0.964 R² .. 0.998 R²)
mean                 248.1 μs   (242.9 μs .. 259.4 μs)
std dev              23.53 μs   (12.04 μs .. 38.22 μs)
variance introduced by outliers: 77% (severely inflated)

benchmarking 211KB/xeno-sax
time                 255.1 μs   (233.4 μs .. 276.1 μs)
                     0.971 R²   (0.959 R² .. 0.989 R²)
mean                 230.3 μs   (222.5 μs .. 240.0 μs)
std dev              29.79 μs   (22.70 μs .. 40.46 μs)
variance introduced by outliers: 86% (severely inflated)

benchmarking 211KB/xeno-dom
time                 783.7 μs   (656.5 μs .. 949.1 μs)
                     0.787 R²   (0.728 R² .. 0.895 R²)
mean                 888.4 μs   (809.2 μs .. 1.019 ms)
std dev              315.6 μs   (249.4 μs .. 397.7 μs)
variance introduced by outliers: 98% (severely inflated)

benchmarking 211KB/hexpat-sax
time                 18.44 ms   (14.61 ms .. 22.18 ms)
                     0.875 R²   (0.820 R² .. 0.969 R²)
mean                 23.69 ms   (21.02 ms .. 27.38 ms)
std dev              7.008 ms   (4.904 ms .. 9.546 ms)
variance introduced by outliers: 89% (severely inflated)

benchmarking 211KB/hexpat-dom
time                 20.10 ms   (17.66 ms .. 22.25 ms)
                     0.923 R²   (0.789 R² .. 0.983 R²)
mean                 24.17 ms   (22.73 ms .. 26.57 ms)
std dev              3.973 ms   (2.581 ms .. 5.958 ms)
variance introduced by outliers: 69% (severely inflated)

benchmarking 211KB/xml-dom
time                 84.14 ms   (53.48 ms .. 95.22 ms)
                     0.903 R²   (0.682 R² .. 0.999 R²)
mean                 105.2 ms   (94.69 ms .. 136.5 ms)
std dev              27.07 ms   (1.725 ms .. 40.24 ms)
variance introduced by outliers: 76% (severely inflated)

Benchmark xeno-speed-bench: FINISH
Completed 73 action(s).
MartinPotier commented 6 years ago

Whitespace branch:

Running 2 benchmarks...
Benchmark xeno-memory-bench: RUNNING...

Case             Allocated  GCs
4kb/hexml/dom        3,808    0
4kb/xeno/sax         -,496    0
4kb/xeno/dom         7,968    0
31kb/hexml/dom      30,608    0
31kb/xeno/sax        -,928    0
31kb/xeno/dom        6,056    0
211kb/hexml/dom    211,496    0
211kb/xeno/sax      26,576    0
211kb/xeno/dom   1,043,328    0
Benchmark xeno-memory-bench: FINISH
Benchmark xeno-speed-bench: RUNNING...
benchmarking 4KB/hexml-dom
time                 9.143 μs   (8.837 μs .. 9.459 μs)
                     0.992 R²   (0.986 R² .. 0.997 R²)
mean                 8.893 μs   (8.705 μs .. 9.255 μs)
std dev              884.4 ns   (588.1 ns .. 1.378 μs)
variance introduced by outliers: 86% (severely inflated)

benchmarking 4KB/xeno-sax
time                 4.120 μs   (4.075 μs .. 4.159 μs)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 4.058 μs   (4.006 μs .. 4.161 μs)
std dev              262.7 ns   (137.0 ns .. 486.1 ns)
variance introduced by outliers: 74% (severely inflated)

benchmarking 4KB/xeno-dom
time                 9.651 μs   (9.376 μs .. 9.983 μs)
                     0.992 R²   (0.986 R² .. 0.998 R²)
mean                 9.909 μs   (9.686 μs .. 10.19 μs)
std dev              872.1 ns   (708.7 ns .. 1.251 μs)
variance introduced by outliers: 83% (severely inflated)

benchmarking 4KB/hexpat-sax
time                 71.90 μs   (68.60 μs .. 76.49 μs)
                     0.975 R²   (0.954 R² .. 0.991 R²)
mean                 73.57 μs   (70.20 μs .. 80.64 μs)
std dev              14.43 μs   (6.847 μs .. 25.25 μs)
variance introduced by outliers: 95% (severely inflated)

benchmarking 4KB/hexpat-dom
time                 207.4 μs   (201.9 μs .. 213.8 μs)
                     0.994 R²   (0.991 R² .. 0.997 R²)
mean                 205.7 μs   (202.7 μs .. 209.4 μs)
std dev              11.86 μs   (9.518 μs .. 15.88 μs)
variance introduced by outliers: 56% (severely inflated)

benchmarking 4KB/xml-dom
time                 2.425 ms   (2.360 ms .. 2.526 ms)
                     0.970 R²   (0.948 R² .. 0.986 R²)
mean                 2.236 ms   (2.116 ms .. 2.379 ms)
std dev              439.1 μs   (343.7 μs .. 560.7 μs)
variance introduced by outliers: 90% (severely inflated)

benchmarking 31KB/hexml-dom
time                 13.20 μs   (12.65 μs .. 13.89 μs)
                     0.974 R²   (0.960 R² .. 0.988 R²)
mean                 14.45 μs   (13.48 μs .. 16.64 μs)
std dev              4.307 μs   (2.414 μs .. 8.058 μs)
variance introduced by outliers: 99% (severely inflated)

benchmarking 31KB/xeno-sax
time                 2.178 μs   (2.154 μs .. 2.209 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 2.183 μs   (2.156 μs .. 2.245 μs)
std dev              128.6 ns   (66.64 ns .. 236.9 ns)
variance introduced by outliers: 71% (severely inflated)

benchmarking 31KB/xeno-dom
time                 6.806 μs   (6.556 μs .. 7.070 μs)
                     0.987 R²   (0.976 R² .. 0.995 R²)
mean                 6.711 μs   (6.502 μs .. 7.050 μs)
std dev              811.9 ns   (542.0 ns .. 1.216 μs)
variance introduced by outliers: 91% (severely inflated)

benchmarking 31KB/hexpat-sax
time                 322.0 μs   (296.6 μs .. 356.0 μs)
                     0.942 R²   (0.907 R² .. 0.974 R²)
mean                 333.2 μs   (317.0 μs .. 355.2 μs)
std dev              58.86 μs   (43.57 μs .. 86.36 μs)
variance introduced by outliers: 92% (severely inflated)

benchmarking 31KB/hexpat-dom
time                 374.8 μs   (355.3 μs .. 395.7 μs)
                     0.979 R²   (0.968 R² .. 0.991 R²)
mean                 394.8 μs   (375.7 μs .. 436.9 μs)
std dev              96.14 μs   (45.08 μs .. 177.7 μs)
variance introduced by outliers: 95% (severely inflated)

benchmarking 31KB/xml-dom
time                 15.52 ms   (14.78 ms .. 16.26 ms)
                     0.987 R²   (0.975 R² .. 0.994 R²)
mean                 16.48 ms   (15.92 ms .. 17.36 ms)
std dev              1.690 ms   (1.128 ms .. 2.594 ms)
variance introduced by outliers: 48% (moderately inflated)

benchmarking 211KB/hexml-dom
time                 379.2 μs   (359.4 μs .. 401.2 μs)
                     0.974 R²   (0.960 R² .. 0.986 R²)
mean                 378.3 μs   (358.4 μs .. 417.5 μs)
std dev              90.25 μs   (57.77 μs .. 180.8 μs)
variance introduced by outliers: 95% (severely inflated)

benchmarking 211KB/xeno-sax
time                 226.8 μs   (222.5 μs .. 232.5 μs)
                     0.995 R²   (0.993 R² .. 0.998 R²)
mean                 234.6 μs   (230.7 μs .. 240.0 μs)
std dev              16.45 μs   (13.84 μs .. 19.01 μs)
variance introduced by outliers: 64% (severely inflated)

benchmarking 211KB/xeno-dom
time                 703.2 μs   (679.9 μs .. 735.4 μs)
                     0.958 R²   (0.929 R² .. 0.980 R²)
mean                 798.2 μs   (748.4 μs .. 897.1 μs)
std dev              227.5 μs   (144.9 μs .. 436.6 μs)
variance introduced by outliers: 97% (severely inflated)

benchmarking 211KB/hexpat-sax
time                 28.17 ms   (25.24 ms .. 32.59 ms)
                     0.937 R²   (0.873 R² .. 0.989 R²)
mean                 27.98 ms   (26.05 ms .. 30.60 ms)
std dev              4.626 ms   (3.002 ms .. 6.750 ms)
variance introduced by outliers: 65% (severely inflated)

benchmarking 211KB/hexpat-dom
time                 41.27 ms   (31.05 ms .. 51.55 ms)
                     0.873 R²   (0.816 R² .. 0.980 R²)
mean                 32.56 ms   (30.40 ms .. 36.64 ms)
std dev              6.351 ms   (3.747 ms .. 9.864 ms)
variance introduced by outliers: 74% (severely inflated)

benchmarking 211KB/xml-dom
time                 91.71 ms   (4.966 ms .. 161.3 ms)
                     0.749 R²   (0.171 R² .. 0.986 R²)
mean                 134.8 ms   (113.6 ms .. 158.8 ms)
std dev              33.58 ms   (23.60 ms .. 46.92 ms)
variance introduced by outliers: 73% (severely inflated)

Benchmark xeno-speed-bench: FINISH
Completed 2 action(s).
MartinPotier commented 6 years ago

Please note that I'm running this on a laptop with an Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz. Not really fast, and almost always doing other things on the side.

ocramz commented 6 years ago

@MartinPotier yes, there are lots of outlying measurements, it would be good to run these without many "noisy neighbours" (i.e. not many other processes competing for CPU)

MartinPotier commented 6 years ago

This is my work laptop and it's difficult to make it quiet :smile_cat: I'll run this at home on a beefier machine, maybe that'll be better.

MartinPotier commented 6 years ago

Hmmm, unfortunately, at home I can't run the bench:

Running 2 benchmarks...
Benchmark xeno-memory-bench: RUNNING...
Case             Allocated  GCs
4kb/hexml/dom        4,120    0
4kb/xeno/sax         -,496    0
4kb/xeno/dom         7,968    0
31kb/hexml/dom      26,272    0
31kb/xeno/sax       -1,240    0
31kb/xeno/dom        7,464    0
211kb/hexml/dom    211,496    0
211kb/xeno/sax      26,504    0
211kb/xeno/dom   1,043,016    0
Benchmark xeno-memory-bench: FINISH
Benchmark xeno-speed-bench: RUNNING...
benchmarking 4KB/hexml-dom
xeno-speed-bench: <stdout>: commitBuffer: invalid argument (invalid character)
time                 7.325 Benchmark xeno-speed-bench: ERROR

Looks like a locale problem, my locale is fine:

$ locale
LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Pretty much the same system than at work, except that stack is more recent here:

Version 1.7.1 x86_64
Compiled with:
- Cabal-2.2.0.1
- Glob-0.9.2
- HUnit-1.6.0.0
- QuickCheck-2.10.1
- StateVar-1.1.1.0
- aeson-1.2.4.0
- aeson-compat-0.3.8
- annotated-wl-pprint-0.7.0
- ansi-terminal-0.8.0.4
- ansi-wl-pprint-0.6.8.2
- array-0.5.2.0
- asn1-encoding-0.9.5
- asn1-parse-0.9.4
- asn1-types-0.3.2
- async-2.1.1.1
- attoparsec-0.13.2.2
- attoparsec-iso8601-1.0.0.0
- auto-update-0.1.4
- base-4.10.1.0
- base-compat-0.9.3
- base-orphans-0.7
- base-prelude-1.2.1
- base16-bytestring-0.1.1.6
- base64-bytestring-1.0.0.1
- basement-0.0.7
- bifunctors-5.5.2
- binary-0.8.5.1
- bindings-uname-0.1
- bitarray-0.0.1.1
- blaze-builder-0.4.1.0
- blaze-html-0.9.1.1
- blaze-markup-0.8.2.1
- byteable-0.1.1
- bytestring-0.10.8.2
- call-stack-0.1.0
- case-insensitive-1.2.0.11
- cereal-0.5.5.0
- clock-0.7.2
- colour-2.3.4
- comonad-5.0.3
- conduit-1.3.0.3
- conduit-extra-1.3.0
- connection-0.2.8
- containers-0.5.10.2
- contravariant-1.4.1
- cookie-0.4.4
- cpphs-1.20.8
- cryptohash-0.11.9
- cryptohash-sha256-0.11.101.0
- cryptonite-0.25
- cryptonite-conduit-0.2.2
- data-default-class-0.1.2.0
- deepseq-1.4.3.0
- digest-0.0.1.2
- directory-1.3.0.2
- distributive-0.5.3
- dlist-0.8.0.4
- easy-file-0.2.2
- echo-0.1.3
- ed25519-0.0.5.0
- either-5
- exceptions-0.8.3
- extra-1.6.8
- fast-logger-2.4.11
- file-embed-0.0.10.1
- filelock-0.1.1.2
- filepath-1.4.1.2
- foundation-0.0.20
- free-5.0.2
- fsnotify-0.2.1.1
- generic-deriving-1.12.1
- ghc-boot-th-8.2.2
- ghc-prim-0.5.1.1
- gitrev-1.3.1
- hackage-security-0.5.3.0
- hashable-1.2.7.0
- haskell-src-exts-1.20.2
- haskell-src-meta-0.8.0.3
- hinotify-0.3.9
- hourglass-0.2.11
- hpack-0.28.2
- hpc-0.6.0.3
- hspec-2.4.8
- hspec-core-2.4.8
- hspec-discover-2.4.8
- hspec-expectations-0.8.2
- hspec-smallcheck-0.5.0
- http-api-data-0.3.7.2
- http-client-0.5.13
- http-client-tls-0.3.5.3
- http-conduit-2.3.1
- http-types-0.12.1
- integer-gmp-1.0.1.0
- integer-logarithms-1.0.2.1
- lifted-base-0.2.3.12
- logict-0.6.0.2
- memory-0.14.16
- microlens-0.4.8.3
- microlens-th-0.4.1.3
- mime-types-0.1.0.7
- mintty-0.1.2
- monad-control-1.0.2.3
- monad-logger-0.3.28.5
- monad-loops-0.4.3
- mono-traversable-1.0.8.1
- mtl-2.2.2
- mustache-2.3.0
- neat-interpolation-0.3.2.1
- network-2.6.3.5
- network-uri-2.6.1.0
- old-locale-1.0.0.7
- old-time-1.1.0.3
- open-browser-0.2.1.0
- optparse-applicative-0.14.2.0
- optparse-simple-0.1.0
- parsec-3.1.13.0
- path-0.6.1
- path-io-1.3.3
- path-pieces-0.2.1
- pem-0.2.4
- persistent-2.8.2
- persistent-sqlite-2.8.1.2
- persistent-template-2.5.4
- polyparse-1.12
- pretty-1.1.3.3
- primitive-0.6.4.0
- process-1.6.1.0
- profunctors-5.2.2
- project-template-0.2.0.1
- quickcheck-io-0.2.0
- random-1.1
- regex-applicative-0.3.3
- regex-applicative-text-0.1.0.1
- resource-pool-0.2.3.2
- resourcet-1.2.1
- retry-0.7.6.2
- rio-0.1.3.0
- rts-1.0
- safe-0.3.17
- scientific-0.3.6.2
- semigroupoids-5.2.2
- semigroups-0.18.4
- setenv-0.1.1.3
- silently-1.2.5
- smallcheck-1.1.4
- socks-0.5.6
- split-0.2.3.3
- stm-2.4.5.0
- stm-chans-3.0.0.4
- store-0.4.3.2
- store-core-0.4.4
- streaming-commons-0.1.19
- syb-0.7
- tagged-0.8.5
- tar-0.5.1.0
- template-haskell-2.12.0.0
- temporary-1.2.1.1
- text-1.2.3.0
- text-metrics-0.3.0
- tf-random-0.5
- th-abstraction-0.2.7.0
- th-expand-syns-0.4.4.0
- th-lift-0.7.10
- th-lift-instances-0.1.11
- th-orphans-0.13.5
- th-reify-many-0.1.8
- th-utilities-0.2.0.1
- time-1.8.0.2
- time-locale-compat-0.1.1.4
- tls-1.4.1
- transformers-0.5.2.0
- transformers-base-0.4.4
- transformers-compat-0.5.1.4
- typed-process-0.2.2.0
- unicode-transforms-0.3.4
- unix-2.7.2.2
- unix-compat-0.5.0.1
- unix-time-0.3.8
- unliftio-0.2.7.0
- unliftio-core-0.1.1.0
- unordered-containers-0.2.9.0
- uri-bytestring-0.3.2.0
- uuid-types-1.0.3
- vector-0.12.0.1
- vector-algorithms-0.7.0.1
- void-0.7.2
- x509-1.7.3
- x509-store-1.6.6
- x509-system-1.6.6
- x509-validation-1.6.10
- yaml-0.8.30
- zip-archive-0.3.2.5
- zlib-0.6.2

Warning: this is an unsupported build that may use different versions of
dependencies and GHC than the officially released binaries, and therefore may
not behave identically.  If you encounter problems, please try the latest
official build by running 'stack upgrade --force-download'.
qrilka commented 6 years ago

@ocramz on my machine I see some increase in DOM parsing almost 5% - https://gist.github.com/qrilka/d36464cb52499bf1041b1bd7c0dd341d/revisions (the new results are the ones from the branch)

pkamenarsky commented 3 years ago

This should be solved by #49 and #51, please reopen if that's not the case.