ProcessQueue "Command Build Progress" is slow

jtniehof commented 4 years ago

As mentioned in #20, the command building (buildChildren) in ProcessQueue is pretty slow, and it's maybe slower than it had been in the past, but it's hard to nail down exactly where this got slower. It seems to get slower as it goes further along the process queue, i.e. the first few build quickly but then it slows down--this might just be a coincidence.

Minimal example to reproduce issue:

I'm grabbing a copy of the HOPE database to play with and:

reprocessByDate.py -m hope.sqlite -s 20180101 -e 20180101
ProcessQueue.py -d -p -m /home/ectsoc/PROCESSING_DB/hope.sqlite

Because of #22, I've edited my copy of ProcessQueue.py to just skip the call to runMe.runner, so I can time the command build only.

OS, Python version, and dependency version information:

Production environment where this was observed:

Linux-3.10.0-957.12.2.el7.x86_64-x86_64-with-centos-7.8.2003-Core
sys.version_info(major=2, minor=7, micro=15, releaselevel='final', serial=0)
sqlalchemy=1.2.15

Further testing will be on:

Linux-4.4.0-98-generic-x86_64-with-Ubuntu-16.04-xenial
sys.version_info(major=2, minor=7, micro=12, releaselevel='final', serial=0)
sqlalchemy=1.0.11

We'll see if that makes a difference...

Version of dbprocessing

07f3e3a1ace8f4f72e482e2dd47b951fe00d140d checked out from github (master)

Closure condition

This issue should be closed when the speed has been evaluated and either a speedup commit merged, or the issue deemed intractable.

jtniehof commented 4 years ago

Here's a call graph of the command build: callgraph.pdf Sure looks like most of the time is in the actual SQL calls. We can of course go for higher-level optimization.

Some thrashing with google: https://docs.sqlalchemy.org/en/13/faq/performance.html Nothing stands out. Might be an issue with datetime conversions?

I wonder about the fact that we're making a TON of file objects for the ORM to keep track of. Maybe that's why each successive one gets slower, that it's maintaining state between those objects? All we're using is the file_id, but there may be other places that call getFiles that need the full ORM objects. (Try using .values() instead of .all()?) At the same time, they're going out of scope as soon as the runMe is built, so they shouldn't be piling up. Maybe it's worth trying an explicit session.expire....

We're also passing around a lot of references to the self.dbu object, one in every runMe.

https://dev.to/zchtodd/sqlalchemy-performance-anti-patterns-and-their-fixes-4bmm also suggests that making the full ORM objects is often an issue.

But since it looks like it's the actual sqlite execution that's taking time, that suggests it's not a problem with making the full ORM mapping objects.

Mem-mapping the sqlite might also help, but I'm still suspicious of how it slows over time: https://stackoverflow.com/questions/53396186/improve-sqlite-speed-when-data-is-not-kept-on-cache

jtniehof commented 4 years ago

Example with current master on standalone machine: 53s Example using a pure date-based lookup: 4s

So the big difference is the switch from lookup by utc_file_date to lookup by utc_start_time and utc_stop_time in order to deal with the level 0 files that span days. We do have an index on utc_start_time and utc_stop_time but we also had an index on utc_file_date and it got faster when using between for the lookup instead of a greater-than and less-than. However: utc_file_date lookup with >= and <= instead of between: 4s So that's not the problem.

If I keep the comparison by utc_start_time/utc_stop_time but use dates instead of datetimes for the comparisons, it's also no real difference...53s.

My guess is that the "slowdown over time" is just from processes later in the queue having more inputs, and thus needing to do more UTC comparisons.

So it looks like the datetime comparison is much more expensive than the date comparison. Looks like this is being stored as text:

sqlite> select typeof(utc_start_time) from file limit 1;
typeof(utc_start_time)
text
sqlite> select typeof(utc_file_date) from file limit 1;
typeof(utc_file_date)
text

I wonder if it isn't thrashing on converting every single string in the database into some sort of datetime.

jtniehof commented 4 years ago

Doing comparisons to datetime strings instead of datetimes takes...53 seconds. So either there's something inherently slow here or that's not a way to sidestep the datetime function conversions.

jtniehof commented 4 years ago

I am seriously tempted to add a TAI column to the database to deal with this :)

jtniehof commented 4 years ago

Actual statement:

SELECT file.file_id, file.filename, file.utc_file_date, file.utc_start_time, file.utc_stop_time, file.data_level, file.interface_version, file.quality_version, file.revision_version, file.verbose_provenance, file.check_date, file.quality_comment, file.caveats, file.file_create_date, file.met_start_time, file.met_stop_time, file.exists_on_disk, file.quality_checked, file.product_id, file.shasum, file.process_keywords 
FROM file 
WHERE file.product_id = ? AND file.utc_stop_time >= ? AND file.utc_start_time <= ? ORDER BY file.interface_version, file.quality_version, file.revision_version

Bound parameters are indeed strings if I'm putting strings in as arguments to filter, and datetimes if they're datetimes. So it's possible that, based on the datetime hint in the database, the Python sqlite3 code is consistently converting the string to datetime and then having to convert it back or something ridiculous. But if I pass in a non-datetime string ('foo'), it doesn't error out...no idea what sort of weirdness it involves, though!

jtniehof commented 4 years ago

I added another column that was explicitly of TEXT type and copied the utc_start_time and utc_stop time to it, and did the lookup explicitly with a string instead of datetime. That took 18s. So it looks like we lose about a factor of 4 in speed by doing a time lookup instead of just date, and a factor of 3-ish on top of that for the datetime processing back-and-forth.

I'm not sure there's an obvious way forward.

balarsen commented 4 years ago

I am seriously tempted to add a TAI column to the database to deal with this :)

I have heard worse ideas!

jtniehof commented 4 years ago

It probably wouldn't be too hideous to test how fast/slow that runs. We'd pick up a spacepy.time dependency; I don't want to do all that by hand.

jtniehof commented 4 years ago

Or we could do something stupid like UNIX time, but I don't really like the idea of half-measures there.

balarsen commented 4 years ago

Or we could do something stupid like UNIX time, but I don't really like the idea of half-measures there.

Yeah unix time is fast and a pretty good choice, can query and use the current time as well. Might be a good replacement for some of the logging.

jtniehof commented 4 years ago

I had a thought that this might be addressable with a temporary table and a join rather than committing to updating the file table. Will give that a whirl.

jtniehof commented 4 years ago

Doing the date comparison in Python, leaving everything about the database the same, takes 23 seconds (vs. 53s for having the database do the work.)

jtniehof commented 4 years ago

Adding a table with just file_id and Unix start/stop times: 35 seconds Doing the query with a join on this table: 9.4 s.

So it takes about twice as long as the date-based lookup (which isn't a surprise, it's doing twice as many comparisons!), and I really haven't tried to optimize the query e.g. trying some eager loading.

The benefit of doing this with a separate table and a join, instead of adding a column, is that it's a little more reversible if we decide on, say, TAI instead of Unix time.

balarsen commented 4 years ago

Worth reading this when deciding what number based time we use https://www.sqlite.org/lang_datefunc.html points to maybe julday being a good choice.

jtniehof commented 4 years ago

Using a new column in the File table rather than a new table + a join, it takes 30s to add a column and then 21s to do the query on the new column. So the new table + join is not only more flexible, it's faster. (There might be something here in the sqlalchemy/database interaction that's the real speedup, but whatever, it works.)

Even better, by using a separate table and indexing the Unix time columns, it's down to 4.5 s, basically the same as the date-based comparisons.

I'm going to run with this solution. Unix time works for now and we can discuss unix vs. Julian vs. TAI later.

jtniehof commented 4 years ago

Almost done. I just need to write a migration script (that I think will ultimately grow into a migrate everything script, so it'll take a little bit to structure properly.)

balarsen commented 4 years ago

This is looking a lot better.

Am getting a lot of Was not newest version in buildChildren: file_id=(678, None)

jtniehof commented 4 years ago

That's what you get when something is on the process queue but is not the latest version of the file. If there's a reason a newer version has showed up, that makes sense (e.g. ingesting v1.0.0. and 1.1.0 at the same time, or reprocess all l1 and all l2, so the l1 makes a new l2 before the l2 gets hit). But otherwise it probably shouldn't happen...reprocessBy shouldn't put old files on the process queue, for instance.

balarsen commented 4 years ago

That's what you get when something is on the process queue but is not the latest version of the file. If there's a reason a newer version has showed up, that makes sense (e.g. ingesting v1.0.0. and 1.1.0 at the same time, or reprocess all l1 and all l2, so the l1 makes a new l2 before the l2 gets hit). But otherwise it probably shouldn't happen...reprocessBy shouldn't put old files on the process queue, for instance.

Ok, this is exactly the scenario. I am building a fresh hope and doing test runs and just slamming a month of files in there, so there are versions of L0 and mag files.

If it only runs them once we are good. (I'll make sure)

spacepy / dbprocessing