Closed GoogleCodeExporter closed 9 years ago
Is there any reason to document it? This is the case with all Python code in
CPython (other implementations may differ), so the cases where things are
threadsafe are the notable exceptions.
Original comment by geoffers
on 16 Sep 2011 at 12:23
(Most?) Everything in the python standard library is threadsafe and most
extensions are. I think you are referring to the GIL, which is different. That
prevents parallel execution, but if one thread is blocking, the others can run
safely.
The problem with the design of HTMLParser is that two threads can interfere
with each other, even if they are not running at the same time.
Original comment by devin.ba...@gmail.com
on 16 Sep 2011 at 1:05
This is clearly a defect. This is an object-oriented library in an object
oriented language. Two parsers should be completely independent of each other,
with no shared global variables, and thus thread-safe. If that's not the case,
this is a defect.
Do I have to scrap my plans to convert a parallel web crawler from
BeautifulSoup to html5lib?
This looks fixable. The trouble spots include at least these global variables:
dom.py: moduleCache
That could be easily fixed with a lock in getDomModule. That's a once per parse
event, so there's no performance issue. All that's needs is
import threading
...
Lok = threading.Lock()
with Lok() :
... critical section...
etree.py: moduleCache
Same issue.
etree.lxml: fullTree
This seems to be set only once, at load time. Is it changed elsewhere?
what have I missed? Some lower level library? Is Python's SAX parser unsafe?
This can and should be fixed.
Original comment by na...@animats.com
on 11 Mar 2012 at 8:23
https://github.com/html5lib/html5lib-python/issues/8
Original comment by geoffers
on 9 Apr 2013 at 9:28
Original issue reported on code.google.com by
devin.ba...@gmail.com
on 25 Jul 2011 at 9:26