mergesort / Boutique

✨ A magical persistence library (and so much more) for state-driven iOS and Mac apps ✨
https://build.ms/boutique/docs
MIT License
899 stars 43 forks source link

Some potential performance issues #14

Closed pofat closed 2 years ago

pofat commented 2 years ago

This library is truly a nice and simple solution for a lightweight store. However, I noticed some potential performance problems. (It might be a little early for this topic but I saw you mentioned this is used on production)

  1. When there is persisted data, Boutique will write the same files again at launch. Even without any change.
  2. Files are written again and again. Every change will rewrite all items in the array because you iterate the whole stored array (items).
  3. Unnecessary file rewrite. There's no update interface for @Store now, but you can reproduce with self.$items.add([]). You will find all files are written again even if there's no change.

I can come up with a quick PR for 3. For 1, I guess it's the replay when you first subscribe to items. Need more investigation. For 2, I propose another property for old items and diff algorithm to only persist diff items.

What do you think?

mergesort commented 2 years ago

Hey @pofat, I saw you put up the PR for issue 3, thanks for that, I really appreciate it!

I'm definitely looking for performance suggestions, it's actually what I hope to spend time over the next week on and admittedly is not my area of expertise. I'm first going to do some more defined profiling rather than usage in apps and seeing that it works, and then exploring some potential changes that I'm curious for your feedback.

  1. I'm sorry but I'm not quite sure what you mean by:

    replay when you first subscribe to items

Any chance you could expand on that?

  1. The way I read your suggestion it means we would have two properties, oldItems and items, so when the user adds an item to items, we would find the diff between oldItems and items to decide what to write to disk? If that's right my only concern is that this would double the memory footprint, but maybe that's not something to worry about? And if that's not right please let me know.

The other ideas I've been considering (but didn't feel necessary for launch since I can iterate on them [at least mostly] without breaking changes) and would love your perspective on are:

A) Using an OrderedDictionary rather than an array for the underlying Store, since they tend to have better performance characteristics. It seems like your change to use removeDuplicates() may have obviated that need though, since one of the reasons I was looking to do that was to remove duplicates, but perhaps there's still better performance to be had?

B) Changing the Store from a class to an actor. I believe I'm already going to have to make the initializer async, so in that case I was thinking of exploring turning it into an actor, though I haven't thought through all of the side effects yet. (No pun intended.)

C) Move the $items Combine binding via .sink to an AsyncStream. This was a suggestion from @stephencelis, because "If a user is serializing to disk a lot, if a Task is serializing to disk back-to-back, the first may win, two Tasks that are executed at the same time can happen in any order." Even though I'm not very familiar with AsyncStream yet, this resonates with me a lot and is something I want to explore but am also curious to hear what you think, and figure may be good to document in this issue about performance considerations.

Would love to hear what you think, and very much appreciate all of the suggestions! Please keep 'em coming. 🙂

pofat commented 2 years ago

Hey, appreciate the fast response! There are a lot topics in it. Let me group them into different sections.

Extra write at Store initialization

  1. When there is persisted data, Boutique will write the same files again at launch. Even without any change. After more tracing, I found out that it is caused by :
    
    // Store.swift

self.$items .removeDuplicates() .sink(receiveValue: { items in Task { try await self.persistItems(items) } }) .store(in: &cancellables)

Task { @MainActor in self.items = await self.allPersistedItems() }

This assignment happens after the subscription (.sink) to the items. However, switching the order of these two steps doesn't help. When subscribing (.sink), the closure will receive the current value. That seems to be how `@Published` works. Not sure if it's a way to avoid that. 

In short, for every store init, Boutique reads all persisted items (via `func allPersistedItems() async -> [Item]`) and then writes all of them back immediately. 

## Extra write for every operation

For 2, I think the most obvious issue is that every **operation to the items will cause every element in it to be written back to file again**, even those unchanged. This can not be solved by `removeDuplicate()` and is mainly caused by that the persistent side effect is tied to `items`. 

Did a very rough observation with MVCS-Demo app, a deletion/add could lead to 100+Mb/s IO. (I have favorited 30 images).

![2022-06-24 11 57 44](https://user-images.githubusercontent.com/4652318/175525257-a4ba033e-d857-47e6-bb38-e7695b03fdfa.gif)

If you agree this is an issue, writing/removing diffed items only might be a feasible approach. 
Add an `oldItem` is the easiest but brutal way to implement.
If you worry about memory footprint, you can just keep the "diff", instead of another array.

And you already have the logic for diffing. While adding new item(s), you update a copy of items, `currentItems`. 

```swift
// Store.swfit

public func add(_ items: [Item], invalidationStrategy: CacheInvalidationStrategy<Item> = .removeNone) async throws {
        var currentItems: [Item] = await self.items

        try await self.removePersistedItems(strategy: invalidationStrategy)
        self.invalidateCache(strategy: invalidationStrategy, items: &currentItems)

        // update currentItems... (diffing here!)

        // We can't capture a mutable array (currentItems) in the closure below so we make an immutable copy.
        // An implicitly captured closure variable is captured by reference while
        // a variable captured in the capture group is captured by value.
        await MainActor.run { [currentItems] in
            self.items = currentItems
        }
    }

These remove/insert and updates can be recorded by some data structure. Then you only have to do file operations based on this structure.


A) Choice of data structure

I guess you mentioned OrderedDictionary is to find a better way for finding unique elements (func uniqueElements(matching keyPath: KeyPath<Element, String>) -> [Element]).

Identified array (by @pointfreeco) came into my mind because it contains both pros of dictionary and array. You also use Identifiable as the cache key; I haven't tried it yet but I have a feeling they can make a good fit.

For B) and C), sorry I don't know much about Swift Concurrency yet. Leave these questions for other folks. Do you have the link to the suggestion of AsyncStream? It would be better to have more context.

subdigital commented 2 years ago

This assignment happens after the subscription (.sink) to the items. However, switching the order of these two steps doesn't help. When subscribing (.sink), the closure will receive the current value. That seems to be how @Published works. Not sure if it's a way to avoid that.

I ran into this in a past project and ended up relying on this behavior and using dropFirst() in the pipeline to ignore that first assignment. I'm not certain this is appropriate here, but it's worth trying.

mergesort commented 2 years ago

Good news, we don't need to do any tricks like depending on dropFirst() or removeDuplicates(), the answer is much simpler and I don't even have to use Combine anymore. I just created a pull request (#22) addressing the performance issues, and the improvements are dramatic to say the least. Loading a Store with 8,000 items is 8-10x faster, down from 15 seconds to between 1-2 seconds on cold starts. Adding or removing items no longer has any performance penalty because I'm no longer rewriting every item every time an operation occurs (duh), so it only takes as long as it would to write those items to disk and to memory — both of which are very, very quick.

I'll leave this issue up for a little bit in case anyone has any comments, but also feel free to check out #22 if you're interested or want to provide some feedback.

Last but not least, thank you @pofat for the great suggestions and documentation of the problems, it was immensely helpful for me in figuring out how to improve Boutique's performance.

mergesort commented 2 years ago

Merged in #22, thanks again for all the helpful suggestions!