groue / GRDB.swift

A toolkit for SQLite databases, with a focus on application development
MIT License
6.74k stars 697 forks source link

Memory inefficiency in DATETIME parsing #333

Closed sobri909 closed 6 years ago

sobri909 commented 6 years ago

Hey 😄 I've finally found some time to get back to database code. Yay!

So anyway, I've found an interesting one while dealing with very large result sets that include DATETIME fields.

I've got some result sets that can get up into the tens or hundreds of thousands (the one I'm looking at right now is ~40k rows), and I need to walk the entire sequence in a single run, as efficiently as possible. But while walking through the results I've been seeing memory use balloon to over 600MB (at which point iOS kills the app).

So I figured the problem was all those model instances in memory. But turns out it wasn't. The model instances themselves are only taking up a handful of megabytes. The real memory eater is a "leak" in Scanner (as used here in DatabaseDateComponents).

Turns out those scanners are pooping out mallocs like crazy, and expecting someone else to clean up after them. Once the sequence has finished iterating, the mallocs do eventually get cleaned up. But if the result set is large enough it won't get that far, and will instead get killed off by iOS. For my 40k rows result set, I'm seeing around 600MB of wasteful mallocs, then iOS terminating the process before iteration completes.

Basically Scanner is a slob.

I'm currently successfully working around it by using a cached DateFormatter:

internal lazy var dateFormatter: DateFormatter = {
    let formatter = DateFormatter()
    formatter.dateFormat = "yyyy-MM-dd HH:mm:ss.SSS"
    formatter.timeZone = TimeZone(abbreviation: "UTC")
    return formatter
}()
if let dateString = value.storage.value as? String {
    return (column, store.dateFormatter.date(from: dateString))
}

But that obviously isn't a portable solution, because of the hardcoded dateFormat.

While hunting for better solutions I stumbled onto this blog post, which makes use of vsscanf(), which looks super fun! But I haven't had a chance to experiment with it yet.

sobri909 commented 6 years ago

I'm guessing the most efficient solution would be to hand the job over to SQLite, and fetch the datetimes as strftime('%s') then run them through Date(timeIntervalSince1970:). But I couldn't see a way to tidily fit that in. And given that SQLite is seemingly returning datetimes as strings by default (or am I wrong on that?), there might be a good reason why using unix timestamps isn't desired.

groue commented 6 years ago

Hello @sobri909!

Thanks for this feedback.

given that SQLite is seemingly returning datetimes as strings by default

Not exactly.

SQLite has no concept of dates. It can store int64, doubles, strings, blobs, and nulls. It will interpret certain string and numeric values as dates when told to do so (with the strftime function, for example).

GRDB stores Swift Date as strings, not timestamps, because it was the best default format. But you can still store dates as timestamps. GRDB natively decodes Dates from timestamps with Date(timeIntervalSince1970:):

let date = Date()           // 13 Apr 2018 at 13:58
let row = try DatabaseQueue().inDatabase { db in
    try Row.fetchOne(db, """
        SELECT ? AS str, ? AS timestamp
        """, arguments: [date, date.timeIntervalSince1970])!
}

row["str"]                  // "2018-04-13 11:58:32.117"
row["str"] as Date          // 13 Apr 2018 at 13:58
row["timestamp"]            // 1523620729.02897
row["timestamp"] as Date    // 13 Apr 2018 at 13:58

Back to Scanner :-)

We use a Scanner in order to convert date and date components from strings of various formats:

You could surely let SQLite turn those strings into timestamps, and have GRDB decode doubles into date with Date(timeIntervalSince1970:), as seen above.

But what about fixing this Scanner slob instead?

sobri909 commented 6 years ago

SQLite has no concept of dates. It can store int64, doubles, strings, blobs, and nulls.

Yeah I had a sift through the SQLite docs, and it looks like they store DATETIME as REAL? Which is all fine with me. But it seems to be coming back from the database as TEXT by default, so there's that extra conversion cost.

? AS timestamp

But then I can't be lazy and return everything with SELECT *. Too much work 😂

But what about fixing this Scanner slob instead?

Definitely! I really want to have a go at doing something with vsscanf(), but have run out of time today. It's the first day of Songkran (Thai new year), so I'm being forced to stop working early. Sigh.

If I get a chance I'll have a play around with it tomorrow though. From the sounds of that blog post, it's even more efficient than using a cached DateFormatter. But the trick will be making it work safely with all the different date string formats.

groue commented 6 years ago

Yeah I had a sift through the SQLite docs, and it looks like they store DATETIME as REAL?

Precisely speaking, DATETIME creates an column of numeric affinity.

But it seems to be coming back from the database as TEXT by default.

Yes: numeric affinity does not convert strings into numeric values. Since GRBD stores dates as strings by default (as explained above), SQLite stores strings and fetches strings like "2018-04-13 11:58:32.117".

so there's that extra conversion cost.

When one uses default string storage of dates, yes.

But the trick will be making it work safely with all the different date string formats.

Yeah. Everything happens in DatabaseDateComponents.fromDatabaseValue.

It's the first day of Songkran (Thai new year), so I'm being forced to stop working early

🕺 Chok dee 🎉 !!

sobri909 commented 6 years ago

Aah now I understand. I didn't realise SQLite lets you decide what storage type to use for dates. But reading the SQLite docs here, I see what you mean now.

SQLite does not have a storage class set aside for storing dates and/or times. Instead, the built-in Date And Time Functions of SQLite are capable of storing dates and times as TEXT, REAL, or INTEGER values: ... Applications can chose to store dates and times in any of these formats and freely convert between formats using the built-in date and time functions.

I'd assumed that that choice was SQLite's alone. SQLite's internals are different from what I'm used to, so there's some surprises!

Anyway, your reasoning behind choosing TEXT seems perfectly sane to me, so I won't change that default in my project. I'll instead have a go at using the vsscanf() parser from here and wedging it into DatabaseDateComponents. If I have any success, I'll do a PR 😄

groue commented 6 years ago

Lovely!

sobri909 commented 6 years ago

I've finished up the initial vsscanf() based parser: SQLiteDateParser. (It's temporarily hooked up to DatabaseDateComponents like thus). But haven't had a chance to properly quantify its performance yet.

It's definitely solving the memory blowout problem. And appears to be very fast. But I'm curious to see some numbers on exactly how fast it is, in comparison. The memory issue was the one I needed to solve, but knowing whether it's faster/slower, and by how much, is important too! So I'll get some proper numbers together tomorrow.

groue commented 6 years ago

That's really great, @sobri909. http://jordansmith.io/performant-date-parsing/ makes it sound like vsscanf will bring much improvement 👍

You are already ready for a PR on the development branch. We'll remove useScanfStrategy, of course. And I'll certainly challenge you on the mutex, and on the allocations that look like we should at least have a look at withUnsafeMutablePointer(to:_:).

sobri909 commented 6 years ago

And I'll certainly challenge you on the mutex

Heh. The mutex is necessary for thread safety, because it's reusing the same static pointers for each call, for memory / performance reasons.

Without the mutex, competing threads will clobber each other and get each other's partial dates. (In testing, before I added the mutex, I was seeing dates coming back with mismatched dates/times from competing threads).

I initially wanted to use os_unfair_lock(), but that didn't become available until iOS 10, and in retrospect would've been a bad choice anyway, because access to the parser really should favour fairness over performance. And the performance gain of unfair locks would be negligible for this in most cases anyway, I think.

, and on the allocations that look like we should at least have a look at withUnsafeMutablePointer(to:_:).

Interesting! I wasn't familiar with that method, so did some reading of the docs just now. It sounds like it can't be used in this case because of "Do not store or return the pointer for later use." Although I'm not fully understanding the purpose of the method yet, so I'm probably misinterpreting that sentence.

After cleaning up some other unrelated performance bottlenecks this morning I was able to do some semi reliable performance testing. I did two runs with the vsscanf() strategy and one with the Scanner strategy. The results look good! So I'll clean out the temporary code and make a PR.

vsscanf(), run 1

[totalRowsTimed: 224459, rowTimings.mean: 0.001223]

Memory consumed: 143 MB

vsscanf(), run 2

[totalRowsTimed: 137277, rowTimings.mean: 0.002699]

Memory consumed: 101 MB

Scanner

[totalRowsTimed: 29357, rowTimings.mean: 0.005303]

Memory consumed: 627 MB (terminated)

Notes

sobri909 commented 6 years ago

I should add that I did those timing runs in an unreleased version of Arc App, not in a controlled example project. So the timings and memory consumption for both strategies were padded by other things the app was doing.

(The timed queries were all against the LocomotionSample table, which has two DATETIME fields.)

I also suspect the third strategy of using a bunch of cached DateFormatters would perform reasonably well too. When I tried a cached DateFormatter last week it completely solved the memory consumption problem. Although I didn't do any timings to see whether it was more time efficient or not.

But using that strategy would be a pain in the arse, because it would require keeping a separate cached formatter for every supported date/time string format (ie 10 separate DateFormatters). Ugly!

groue commented 6 years ago

Follow up in #334