HTMLParser is not threadsafe

GoogleCodeExporter commented 9 years ago

Hi. I realize this is by design, but it's not intuitive, since similar standard 
classes like YamlDecoder and JSONDecoder are.

It would be more clear if the input stream was supplied to the constructor, 
like with ElementTree.

But at least, please document this in the class.

Original issue reported on code.google.com by devin.ba...@gmail.com on 25 Jul 2011 at 9:26

GoogleCodeExporter commented 9 years ago

Is there any reason to document it? This is the case with all Python code in 
CPython (other implementations may differ), so the cases where things are 
threadsafe are the notable exceptions.

Original comment by geoffers on 16 Sep 2011 at 12:23

GoogleCodeExporter commented 9 years ago

(Most?) Everything in the python standard library is threadsafe and most 
extensions are. I think you are referring to the GIL, which is different. That 
prevents parallel execution, but if one thread is blocking, the others can run 
safely.

The problem with the design of HTMLParser is that two threads can interfere 
with each other, even if they are not running at the same time.

Original comment by devin.ba...@gmail.com on 16 Sep 2011 at 1:05

GoogleCodeExporter commented 9 years ago

This is clearly a defect.  This is an object-oriented library in an object 
oriented language. Two parsers should be completely independent of each other, 
with no shared global variables, and thus thread-safe. If that's not the case, 
this is a defect.

Do I have to scrap my plans to convert a parallel web crawler from 
BeautifulSoup to html5lib? 

This looks fixable. The trouble spots include at least these global variables:

dom.py: moduleCache

That could be easily fixed with a lock in getDomModule. That's a once per parse 
event, so there's no performance issue.  All that's needs is

import threading
...
Lok = threading.Lock()

with Lok() :
  ... critical section...

etree.py: moduleCache

Same issue.

etree.lxml: fullTree

This seems to be set only once, at load time. Is it changed elsewhere? 

what have I missed?  Some lower level library?  Is Python's SAX parser unsafe? 

This can and should be fixed.

Original comment by na...@animats.com on 11 Mar 2012 at 8:23

GoogleCodeExporter commented 9 years ago

https://github.com/html5lib/html5lib-python/issues/8

Original comment by geoffers on 9 Apr 2013 at 9:28

Changed state: GitHub

html5lib / gcode-import

HTMLParser is not threadsafe #189