scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 215 forks source link

SQLAlchemy: AttributeError: _Connection__connection #377

Open yevgenpapernyk opened 5 years ago

yevgenpapernyk commented 5 years ago

I'm using PostgreSQL as backend for the workers.

The worker config is:

from __future__ import absolute_import
from .common import *

BACKEND = 'frontera.contrib.backends.sqlalchemy.Distributed'
SQLALCHEMYBACKEND_ENGINE = 'postgresql://postgres:example@localhost'
MAX_NEXT_REQUESTS = 2048
NEW_BATCH_DELAY = 3.0

The requirements.txt is:

Scrapy>=0.24.4
psycopg2
SQLAlchemy>=0.9.8
msgpack
frontera[sql,hbase,logging,tldextract,kafka,distributed,strategies]

After a while I'm getting this error periodically:

ERROR:sqlalchemy.queue:_Connection__connection
Traceback (most recent call last):
  File "/home/yp/FronteraFromScratch/venv/lib/python3.7/site-packages/frontera/contrib/backends/sqlalchemy/components.py", line 188, in get_next_requests
    self.session.commit()
  File "/home/yp/FronteraFromScratch/venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1027, in commit
    self.transaction.commit()
  File "/home/yp/FronteraFromScratch/venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 506, in commit
    self.close()
  File "/home/yp/FronteraFromScratch/venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 579, in close
    connection.close()
  File "/home/yp/FronteraFromScratch/venv/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 910, in close
    del self.__connection
AttributeError: _Connection__connection
dorellang commented 4 years ago

This is stems from the fact that SQLAlchemy is not thread safe and the db worker that comes in with Frontera generates batches not on the main thread (the one the frontier and all the backend models were initialized on)

A possible fix would be to just read from the message bus in a that other thread and then schedule the actual batch generation to be run on the main thread. This is what I do in a hacked version of the worker I use myself, but I am hesitant to release it now because the way I do it is quite inelegant and also add some custom logic for other things. Anyway, that's the general idea if you or anybody else are willing to hack.