ndmitchell / tagsoup

Haskell library for parsing and extracting information from (possibly malformed) HTML/XML documents
Other
233 stars 37 forks source link

ByteString version is too slow #11

Open ndmitchell opened 9 years ago

ndmitchell commented 9 years ago

From https://code.google.com/p/ndmitchell/issues/detail?id=290

vasyl said: I've used following attached code for benchmark, the "page.html" could be arbitrary page, for example hackage packages list.

On my PC, String version of tagsoup executes in 132 ms, and ByteString in 453 ms.

{-# LANGUAGE NoMonomorphismRestriction #-}
import Text.HTML.TagSoup
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString.Lazy.Char8 as BL
import Criterion.Main

tagsCount = length . parseTags

main = do
  fb <- B.readFile "page.html"
  fs <- readFile "page.html"
  fl <- BL.readFile "page.html"
  defaultMain [
    bench "String" $ nf tagsCount fs,
    bench "ByteString" $ nf tagsCount fb,
    bench "Lazy ByteString" $ nf tagsCount fl,
    bench "ByteString to String" $ nf tagsCount (B.unpack fb)
    ]

IMO this behavior is bad, because everyone suspect, that ByteString should be faster. I think the best way is to disable bytestrings for now, because converting BS to String is faster anyway (the last benchmark)

@ndmitchell replied:

Hmm, there is a benchmark in tagsoup, and I found them to be the same speed. The reason I included ByteString is that it takes less memory, which does matter for some applications.

I'll see how your benchmarks differ, and combine them in to mine. Tagsoup-0.8 was intended to be an interface release, with Tagsoup-0.9 providing speed. With any luck I'll have ByteString going substantially faster in the next release.

ChristopherKing42 commented 8 years ago

Probably because uncons of a bytestring in less efficient than that of string.

ndmitchell commented 8 years ago

Yes, but really it need a significant rewrite. I should probably open a ticket with the going forward plans for tagsoup and the parser...