Database (SQLite?) issue

SkyCrawl commented 7 years ago

Hey again. Not sure if my problem is related to issue #14 or if it's a Coldsweat issue at all but my running instance seems to have "burned out" earlier today, after running for many days straight. The log only says:

[2016-11-08 12:05:41,060] 23583 ERROR Traceback (most recent call last):
  File "/share/CACHEDEV1_DATA/.qpkg/_MyAutorun/coldsweat/coldsweat/app.py", line 118, in __call__
    app_iter = self.app(environ, start_response)

  File "/share/CACHEDEV1_DATA/.qpkg/_MyAutorun/coldsweat/coldsweat/cascade.py", line 99, in __call__
    return last_app(environ, start_response)

  File "/share/CACHEDEV1_DATA/.qpkg/_MyAutorun/coldsweat/coldsweat/session.py", line 90, in __call__
    return self.app(environ, start_response)

  File "/share/CACHEDEV1_DATA/.qpkg/_MyAutorun/coldsweat/coldsweat/app.py", line 71, in __call__
    response = handler(*args)

  File "/share/CACHEDEV1_DATA/.qpkg/_MyAutorun/coldsweat/coldsweat/frontend.py", line 243, in feed_fetch_all_now
    fc.fetch_all_feeds()

  File "/share/CACHEDEV1_DATA/.qpkg/_MyAutorun/coldsweat/coldsweat/controllers.py", line 301, in fetch_all_feeds
    self.fetch_feeds(feeds)

  File "/share/CACHEDEV1_DATA/.qpkg/_MyAutorun/coldsweat/coldsweat/controllers.py", line 276, in fetch_feeds
    p.map(feed_worker, feeds)

  File "/share/CACHEDEV1_DATA/.qpkg/Entware-ng/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()

  File "/share/CACHEDEV1_DATA/.qpkg/Entware-ng/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value

DatabaseError: database disk image is malformed

[2016-11-08 12:06:09,615] 23583 ERROR Traceback (most recent call last):
  File "/share/CACHEDEV1_DATA/.qpkg/_MyAutorun/coldsweat/coldsweat/app.py", line 118, in __call__
    app_iter = self.app(environ, start_response)

  File "/share/CACHEDEV1_DATA/.qpkg/_MyAutorun/coldsweat/coldsweat/cascade.py", line 99, in __call__
    return last_app(environ, start_response)

  File "/share/CACHEDEV1_DATA/.qpkg/_MyAutorun/coldsweat/coldsweat/session.py", line 90, in __call__
    return self.app(environ, start_response)

  File "/share/CACHEDEV1_DATA/.qpkg/_MyAutorun/coldsweat/coldsweat/app.py", line 71, in __call__
    response = handler(*args)

  File "/share/CACHEDEV1_DATA/.qpkg/_MyAutorun/coldsweat/coldsweat/frontend.py", line 243, in feed_fetch_all_now
    fc.fetch_all_feeds()

  File "/share/CACHEDEV1_DATA/.qpkg/_MyAutorun/coldsweat/coldsweat/controllers.py", line 296, in fetch_all_feeds
    feeds = list(q)

  File "/share/CACHEDEV1_DATA/.qpkg/Entware-ng/lib/python2.7/site-packages/peewee.py", line 2883, in __len__
    return len(self.execute())

  File "/share/CACHEDEV1_DATA/.qpkg/Entware-ng/lib/python2.7/site-packages/peewee.py", line 2000, in __len__
    return self.count

  File "/share/CACHEDEV1_DATA/.qpkg/Entware-ng/lib/python2.7/site-packages/peewee.py", line 1996, in count
    self.fill_cache()

  File "/share/CACHEDEV1_DATA/.qpkg/Entware-ng/lib/python2.7/site-packages/peewee.py", line 2043, in fill_cache
    self.next()

  File "/share/CACHEDEV1_DATA/.qpkg/Entware-ng/lib/python2.7/site-packages/peewee.py", line 2029, in next
    obj = self.iterate()

  File "/share/CACHEDEV1_DATA/.qpkg/Entware-ng/lib/python2.7/site-packages/peewee.py", line 2006, in iterate
    row = self.cursor.fetchone()

ProgrammingError: Cannot operate on a closed database.

The second error seems to only happen because of the first and should hold no real significance.

EDIT:

Restarting the NAS and firing up Coldsweat again seems to resolve the issue for a time :).
Although the problem originates at my alteration of Coldsweat (see #103), I'm only using the fetch_all_feeds() function which is used by the refresh command as well.

As the issue keeps repeating itself, I'll try to experiment and dig deeper :).

passiomatic commented 7 years ago

Thank you for your report. If you haven't already I suggest to switch the log to DEBUG, it prints out a lot of information, including connection opening and closing.

Generally speaking dealing with sqlite in a multiprocessing environment is not easy, since it is not really designed for concurrency. A while ago I've read this: http://beets.io/blog/sqlite-nightmare.html.

Also try to switch to processes: 0 in config file and see if the problem pops up again.

SkyCrawl commented 7 years ago

Thank you @passiomatic . Before reading your post, I tried one or two ways to resolve the issue but no luck. After reading it, I didn't happen to reproduce the issue again as I restarted the NAS frequently in the last two weeks. But anyway, with your post and some googling/afterthought, it really hit me that I chose a bad database backend (for my needs) so I just spent about 3 hours migrating the SQLite database into the MariaDB included with my NAS's firmware, which also crosses the backup issue from my TODO list as I intend to use Coldsweat in the long run :). So, a couple questions:

Would you like me to document the process of converting the SQLite image disk into a .sql script that can be imported into a MySQL database?
Would you like me to continue running Coldsweat with SQLite (on a spare NAS of the same brand) so that I can try to apply your suggestions and perhaps, resolve the issue for other users, potentially?

passiomatic commented 7 years ago

I believe point two is more important. It would be fantastic if you can keep running Coldsweat on SQLite with the level: DEBUG and processes: 0 for a while. If issues stil arise we can take a look together at the debug file and see if we can tracking down the reason on the bug.

Regarding the first point one can always export and reimport feeds via export command. In the long run one will be able to export and reimport saved entries across different databases (see #84) so it's a limitation that I can tolerate right now.

SkyCrawl commented 7 years ago

Okay. The spare NAS is running Coldsweat with the original SQLite now. I've set log level to DEBUG but number of processes to 2 (the original number that might have caused this issue). First, let me reproduce the issue again and then, when we have the detailed log, I'll try to zero-out the number of processes :).

SkyCrawl commented 7 years ago

A little off topic:
When I tried to befriend Coldsweat with MariaDB, localhost in the config URL did not work (see here). Perhaps 127.0.0.1 is the better default? I can confirm it resolved the issue for me :).

SkyCrawl commented 7 years ago

Okay, I mananged to reproduce the issue again, this time on the other NAS with the following DEBUG log:

[2016-12-10 19:45:17,013] 30748 DEBUG connecting
[2016-12-10 19:45:17,340] 30748 DEBUG connecting
[2016-12-10 19:45:17,491] 30748 DEBUG closing connection
[2016-12-10 19:45:18,141] 30748 ERROR Traceback (most recent call last):
  File "/share/HDA_DATA/.qpkg/QColdsweat/coldsweat/coldsweat/app.py", line 120, in __call__
    app_iter = self.app(environ, start_response)

  File "/share/HDA_DATA/.qpkg/QColdsweat/coldsweat/coldsweat/cascade.py", line 99, in __call__
    return last_app(environ, start_response)

  File "/share/HDA_DATA/.qpkg/QColdsweat/coldsweat/coldsweat/session.py", line 90, in __call__
    return self.app(environ, start_response)

  File "/share/HDA_DATA/.qpkg/QColdsweat/coldsweat/coldsweat/app.py", line 73, in __call__
    response = handler(*args)

  File "/share/HDA_DATA/.qpkg/QColdsweat/coldsweat/coldsweat/frontend.py", line 243, in feed_fetch_all_now
    fc.fetch_all_feeds()

  File "/share/HDA_DATA/.qpkg/QColdsweat/coldsweat/coldsweat/controllers.py", line 296, in fetch_all_feeds
    feeds = list(q)

  File "/opt/lib/python2.7/site-packages/peewee.py", line 2883, in __len__
    return len(self.execute())

  File "/opt/lib/python2.7/site-packages/peewee.py", line 2000, in __len__
    return self.count

  File "/opt/lib/python2.7/site-packages/peewee.py", line 1996, in count
    self.fill_cache()

  File "/opt/lib/python2.7/site-packages/peewee.py", line 2043, in fill_cache
    self.next()

  File "/opt/lib/python2.7/site-packages/peewee.py", line 2029, in next
    obj = self.iterate()

  File "/opt/lib/python2.7/site-packages/peewee.py", line 2006, in iterate
    row = self.cursor.fetchone()

ProgrammingError: Cannot operate on a closed database.

It looks like there's nothing new compared to the original report :(. However, I feel like the issue is related to HDD sleep mode. When I hit my re-fetch bookmarklet moments ago, I could hear the NAS's HDD waking up. The error was subsequently returned to my brower sooner than I could hear the HDD head's "click" that marks its ready state. Is there some kind of a timeout deeper in the API? A race against time would be a great explanation of why the bookmarklet doesn't fail all the time :).

passiomatic commented 7 years ago

So here I'm assuming you are using processes: 2 liked you said earlier.

It could well be that the HDD sleep mode closes the connection without notice. Reading the log Coldsweat halts at "fetch_all_feeds" on the main process (id: 30748) when it tries to figure out the feeds to refresh. However, if the HDD was not available why the previous access to the db succeed? (I'm looking at that opening/closing connection immediately before the exception in the log).

My feeling is that I'm closing too early the DB connection somewhere else. There's no arm in opening a sqlite more than once -- perhaps Peewee or even sqlite module just reuse the current connection? -- but closing the connection, a shared object, it's not a good idea.

SkyCrawl commented 7 years ago

So here I'm assuming you are using processes: 2 liked you said earlier.

Indeed, that is the case :). As promised, I'm going to try 0 processes this time around, unless you'd like me to keep the number of processes and perhaps suggest some changes that could bring in more information?

However, if the HDD was not available why the previous access to the db succeed?

I'm assuming you meant the first 30748 DEBUG connecting message that is missing its 30748 DEBUG closing connection counterpart? Forgive me but from my point of view, I don't see how the previous access could have succeeded (the HDD was not ready yet after all), unless at least a portion of the database was cached within the connection. Could you please elaborate a bit more? :)

My feeling is that I'm closing too early the DB connection somewhere else. There's no harm in opening a sqlite more than once -- perhaps Peewee or even sqlite module just reuse the current connection? -- but closing the connection, a shared object, it's not a good idea.

Indeed, that could be a very good explanation but if this truly happens, wouldn't the bookmarklet fail every time when the HDD has to be woken up first (I'm making a remark about deterministic behaviour here)? That doesn't seem to be the case, however :). It may be a premature observation resulting from the lack of enough data but so far, I don't think I ever succeeded in reproducing the error with the first or second use of my bookmarklet after the last NAS restart (only the second one possibly having to wake up the HDD). I think that up until now, it's always been the third to fifth use (of which at least one or two had to wake up the HDD), which is quite strange, if true :). I should probably collect more data and do so rather carefully...

I definitely defer to your opinion regarding your own software but by the looks of it, the behaviour doesn't have a deterministic nature and in my experience, that most likely means a timing issue related to threads/processes or more generally, sharing resources (e.g. the connection or something within it, like you suggested). It's even possible that a multitude of our hypothetical culprits meet to produce the issue. I've seen and heard about such things before and they're really nasty :(.

passiomatic commented 7 years ago

I'm assuming you meant the first 30748 DEBUG connecting message that is missing its 30748 DEBUG closing connection counterpart?

Yes.

Looking at you PR again I've noticed you are doing something redundant: you create a FeedController object while it isn't really necessary since FrontendApp is a subclass of FeedController. Instead you can write self.fetch_all_feeds() in the feed_fetch_all_now() method body.

By using another instance of FeedController nests another database connection and subsequent closing. Database closing is done in the base controller class __del__ method so it should trigger only when object lifecycle is finished. However, it is worth trying to update your code as described above.

If this change doesn't help try to use processes: 0. EDIT: this uses only the main process for the fetch.

SkyCrawl commented 7 years ago

Wow, thank you! I'm going to try that immediately. If that resolves the issue, I'm going to feel really awful for wasting your time like this :(.

SkyCrawl commented 7 years ago

Okay, issue is reproduced again and the mentioned redundancy doesn't seem to be relevant. Going to set processes: 0 now... let's wait and see :).

SkyCrawl commented 7 years ago

Yup, I think it's safe to say now that processes: 0 did the trick. Thank you very much for help, @passiomatic :).

passiomatic commented 7 years ago

Glad to hear that. At least your issue is fixed. I'm wondering why it is working now. Having zero processes simplify things a lot so it may be that database connections are not closed in the correct order while using multiprocessing.

In version 0.9.8 (not pushed yet) I'm using a different approach while managing connections - I'm opening and closing them explicitly without relying on base controller __del__ handler. Gosh, that was really a bad idea.

SkyCrawl commented 7 years ago

In version 0.9.8 (not pushed yet) I'm using a different approach while managing connections - I'm opening and closing them explicitly without relying on base controller del handler.

Have you had a look at the handler's implementation to give us a deeper inspiration about the reasons for this behaviour? If you want me to test this new approach, give me a shout :).

One more question: what does the ProgrammingError: Cannot operate on a closed database. error actually mean? Do the SQLite connections get created from a master ('database') object?

EDIT: sorry, nevermind that stupid question. This discussion is about something closing database connections ahead of time :).

SkyCrawl commented 7 years ago

If the handler behaved mischievously, it would be a very logical reason for this issue. But I still could not shake something off (a feeling, really) when I was thinking about this moments ago. The thing that's weirding me out the most is that the bookmarklet works for a time, and then fails - if the handler closed connections ahead of time, why not when I called the bookmarklet the first time? There simply must be some kind of a control condition around the issue (or something non-deterministic, like I mentioned before). Also notice that once the error appeared, subsequent bookmarklet calls failed in the same way. Assuming a call to the bookmarklet naturally creates new database connections, it couldn't do so anymore.

Logically speaking, there is likely a resource within the framework which gets consumed when multiprocessing and prevents new connections from being created. Out of curiosity, I just loaded processes: 2 (version 0.9.7 this time) and called the bookmarklet. After filtering all the feed related log entries, this is what I was left with:

[2017-01-12 12:54:29,026] 21432 DEBUG opening connection
[2017-01-12 12:54:29,050] 21432 DEBUG opening connection
[2017-01-12 12:54:29,056] 21432 DEBUG opening connection
[2017-01-12 12:54:36,302] 21432 DEBUG opening connection
[2017-01-12 12:54:36,331] 21432 DEBUG session db54ff8a3f7073a5350a378c5ff45022949c537a created

[2017-01-12 12:54:36,727] 21432 DEBUG starting fetcher
[2017-01-12 12:54:37,359] 21686 DEBUG opening connection
[2017-01-12 12:54:37,400] 21687 DEBUG opening connection

[2017-01-12 12:57:26,326] 21432 INFO 41 feeds checked in 169.56s
[2017-01-12 12:57:26,482] 21432 DEBUG opening connection
[2017-01-12 12:57:26,535] 21432 DEBUG session 2c1e61438e01302490a3fa9e3ce6b92f7bfe540c created

Isn't it weird? No DEBUG closing connection... Moments ago, I had a hunch this might be connected to thread starvation. Is it at all possible for the handler not to close connections and consequently, for the application to run out of available threads? Notice the location where the original error is spawned:

File "/share/CACHEDEV1_DATA/.qpkg/Entware-ng/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()

  File "/share/CACHEDEV1_DATA/.qpkg/Entware-ng/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value

Threadpool? :)

EDIT: Indeed, when I look at the log of my other NAS (uses internal MariaDB and sets processes: 2):

[2017-01-10 11:19:55,915] 28360 INFO 41 feeds checked in 36.47s
[2017-01-10 11:19:55,939] 28360 DEBUG closing connection
[2017-01-10 11:19:56,157] 28360 DEBUG connecting
[2017-01-10 11:24:58,428] 28360 DEBUG connecting

The DEBUG closing connection message is always printed after a completed feed fetch/check.

passiomatic commented 7 years ago

Threadpool? :)

it's the multiprocessing Pool object. I think it's unrelated.

Isn't it weird? No DEBUG closing connection ...

You may want to try out the 0.9.8-wip branch: https://github.com/passiomatic/coldsweat/commits/0.9.8-wip . I now flush log file after fetch operations so hopefully we can have a complete picture of what's going on. Also you should see less DEBUG opening connection messages.

SkyCrawl commented 7 years ago

I would test the new version but now, I can't even reproduce the issue for version 0.9.6. Strange... as far as I can tell, the only thing that changed in between is the firmware version of my NAS (made an upgrade recently). It's also possible that the python entware package has been upgraded (currently 2.7.12-2) but I don't think so...

In any case, I'll continue to observe the situation :).

passiomatic / coldsweat

Database (SQLite?) issue #104