scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 215 forks source link

[BUG] Clear content and drop tables options are not working properly #398

Open Prometheus3375 opened 4 years ago

Prometheus3375 commented 4 years ago

Now I am using PostgreSQL and got some problems with options SQLALCHEMYBACKEND_CLEAR_CONTENT and SQLALCHEMYBACKEND_DROP_ALL_TABLES when they are set to True. These options are used in this function:

# Source: frontera.contrib.backends.sqlalchemy.__init__.py

def check_and_create_tables(self, is_drop, is_clear, models):
    inspector = Inspector.from_engine(self.engine)
    for model in models:
        if is_drop:
            if model.__table__.name in inspector.get_table_names():
                model.__table__.drop(bind=self.engine)
        if model.__table__.name not in inspector.get_table_names():
            model.__table__.create(bind=self.engine)
        if is_clear:
            session = self.session_cls()
            session.execute(model.__table__.delete())
            session.close()

Let first talk about CLEAR_CONTENT option. It just not working, because commit is missing. Fixed version:

def check_and_create_tables(self, is_drop, is_clear, models):
    inspector = Inspector.from_engine(self.engine)
    for model in models:
        if is_drop:
            if model.__table__.name in inspector.get_table_names():
                model.__table__.drop(bind=self.engine)
        if model.__table__.name not in inspector.get_table_names():
            model.__table__.create(bind=self.engine)
        if is_clear:
            session = self.session_cls()
            session.execute(model.__table__.delete())
            session.commit()
            session.close()

I also think that the session can be created only once, before loop (and committed and closed after).

Now about DROP_ALL_TABLES. It drops tables, but it seems inspector.get_table_names() does not read names directly from database, just uses cached version of database. I. e. model.__table__.name not in inspector.get_table_names() is False after a table is dropped. So, tables are not recreated and this causes errors. It can be fixed in 2 ways:

  1. Recreate the inspector after tables dropping.
  2. Recreate a table immediately after drop.

I choose 2nd way. Fixed version:

def check_and_create_tables(self, is_drop, is_clear, models):
    inspector = Inspector.from_engine(self.engine)
    for model in models:
        table_present = model.__table__.name in inspector.get_table_names()
        if table_present and is_drop:
            model.__table__.drop(bind = self.engine)
        if not table_present or is_drop:
            model.__table__.create(bind = self.engine)
        if is_clear:
            session = self.session_cls()
            session.execute(model.__table__.delete())
            session.commit()
            session.close()