Open barbeau opened 11 years ago
Hmm, that definitely should not be happening, regardless of the database in use. (GTFSrDB should be robust enough to detect failed updates).
I guess that first failure is some kind of network connectivity error; that session should be either committed or rolled back, instead of causing all future sessions to break. I'm guessing SQLAlchemy is keeping the changes in memory since they have been neither committed nor rolled back; a try . . . finally around the update code would probably be a good idea (if the session has not been successfully committed at the end of that block, we should roll it back, although that could still cause issues if we have connection errors).
It looks like everything is ending up in one session; commits are failing and there is no provision to rollback changes in the code I don't think.
So what we need is a mechanism to make sure that a new session is initiated on each iteration, regardless of what's happened before.
I doubt the feed version makes any difference; IIRC that's just a warning. As far as I know there is no public GTFSr spec version 0.1.
@mattwigway thanks, that makes sense. I'm tied up this week with other projects, but I'll hopefully have a chance next week to look at this.
A little more info on this issue - I've had the script running intermittently for a while archiving data, and today I happened to have the logging window open while the source GTFS-rt feed was having issues. As the error message indicates, it seems that an HTTP 404 error triggers the issue. The script is fine with empty GTFS-rt datasets (I also saw those today too), but a 404 starts the Exception occurred in iteration (<class 'sqlalchemy.exc.InvalidRequestError'>, InvalidRequestError("This Session 's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback().
avalanche. After this point, it also fails to add any further data to the database until you restart it. It's still on my TODO list to fix this issue, although now I have an email watchdog set up that alerts me when data stops flowing, so its easier to recover.
Any updates on this issue ?
An alternative approach (if you don't need updates more frequently than once per minute): launch a new process via cron for each update. I have a cron job that launches a one-time update every 3 minutes to pull in new service alerts. I like the crontab approach, since it's one less continually running process to worry about. Two downsides are cron jobs only run once per minute at their highest frequency...and the solution isn't as clean, since there's logic spread across two systems.
I like this solution, is there an arg for it to only do a one time update or did you remove the infinite loop ?
is there an arg for it
Doh!, that would help. I just pushed a commit with the -1 or --once flag to iterate the loop a single time.
I like @fpurcell's solution, running things once versus an ongoing process is simpler. This issue fell off my priority list since the GTFS-rt feed I've been monitoring has been stable, only interrupted once over the last the few months, and the watchdog I have running makes it easy to quickly reset without losing data.
cc @jadorno
After running several instances of gtfsrdb for a few days on the same server, one instance grew to over 3GB in memory and slowed the server down to a crawl. The other instances seemed more reasonable, between 58MB and 195MB. Growth of memory usage may be related to how many objects are processed in the feed, which would explain the seemingly different rates of growth. These instances were all connecting to a SQL Server 2008 database on a remote machine.
Here's the view of the Windows Task Manager:
This error message appeared in the console window: