Closed pofat closed 2 years ago
Hey @pofat, I saw you put up the PR for issue 3, thanks for that, I really appreciate it!
I'm definitely looking for performance suggestions, it's actually what I hope to spend time over the next week on and admittedly is not my area of expertise. I'm first going to do some more defined profiling rather than usage in apps and seeing that it works, and then exploring some potential changes that I'm curious for your feedback.
replay when you first subscribe to
items
Any chance you could expand on that?
oldItems
and items
, so when the user adds an item to items
, we would find the diff between oldItems
and items
to decide what to write to disk? If that's right my only concern is that this would double the memory footprint, but maybe that's not something to worry about? And if that's not right please let me know.The other ideas I've been considering (but didn't feel necessary for launch since I can iterate on them [at least mostly] without breaking changes) and would love your perspective on are:
A) Using an OrderedDictionary rather than an array for the underlying Store, since they tend to have better performance characteristics. It seems like your change to use removeDuplicates()
may have obviated that need though, since one of the reasons I was looking to do that was to remove duplicates, but perhaps there's still better performance to be had?
B) Changing the Store from a class to an actor. I believe I'm already going to have to make the initializer async
, so in that case I was thinking of exploring turning it into an actor, though I haven't thought through all of the side effects yet. (No pun intended.)
C) Move the $items Combine binding via .sink
to an AsyncStream
. This was a suggestion from @stephencelis, because "If a user is serializing to disk a lot, if a Task is serializing to disk back-to-back, the first may win, two Tasks that are executed at the same time can happen in any order." Even though I'm not very familiar with AsyncStream
yet, this resonates with me a lot and is something I want to explore but am also curious to hear what you think, and figure may be good to document in this issue about performance considerations.
Would love to hear what you think, and very much appreciate all of the suggestions! Please keep 'em coming. 🙂
Hey, appreciate the fast response! There are a lot topics in it. Let me group them into different sections.
- When there is persisted data, Boutique will write the same files again at launch. Even without any change. After more tracing, I found out that it is caused by :
// Store.swift
self.$items .removeDuplicates() .sink(receiveValue: { items in Task { try await self.persistItems(items) } }) .store(in: &cancellables)
Task { @MainActor in self.items = await self.allPersistedItems() }
This assignment happens after the subscription (.sink) to the items. However, switching the order of these two steps doesn't help. When subscribing (.sink), the closure will receive the current value. That seems to be how `@Published` works. Not sure if it's a way to avoid that.
In short, for every store init, Boutique reads all persisted items (via `func allPersistedItems() async -> [Item]`) and then writes all of them back immediately.
## Extra write for every operation
For 2, I think the most obvious issue is that every **operation to the items will cause every element in it to be written back to file again**, even those unchanged. This can not be solved by `removeDuplicate()` and is mainly caused by that the persistent side effect is tied to `items`.
Did a very rough observation with MVCS-Demo app, a deletion/add could lead to 100+Mb/s IO. (I have favorited 30 images).
![2022-06-24 11 57 44](https://user-images.githubusercontent.com/4652318/175525257-a4ba033e-d857-47e6-bb38-e7695b03fdfa.gif)
If you agree this is an issue, writing/removing diffed items only might be a feasible approach.
Add an `oldItem` is the easiest but brutal way to implement.
If you worry about memory footprint, you can just keep the "diff", instead of another array.
And you already have the logic for diffing. While adding new item(s), you update a copy of items, `currentItems`.
```swift
// Store.swfit
public func add(_ items: [Item], invalidationStrategy: CacheInvalidationStrategy<Item> = .removeNone) async throws {
var currentItems: [Item] = await self.items
try await self.removePersistedItems(strategy: invalidationStrategy)
self.invalidateCache(strategy: invalidationStrategy, items: ¤tItems)
// update currentItems... (diffing here!)
// We can't capture a mutable array (currentItems) in the closure below so we make an immutable copy.
// An implicitly captured closure variable is captured by reference while
// a variable captured in the capture group is captured by value.
await MainActor.run { [currentItems] in
self.items = currentItems
}
}
These remove/insert and updates can be recorded by some data structure. Then you only have to do file operations based on this structure.
I guess you mentioned OrderedDictionary is to find a better way for finding unique elements (func uniqueElements(matching keyPath: KeyPath<Element, String>) -> [Element]
).
Identified array (by @pointfreeco) came into my mind because it contains both pros of dictionary and array. You also use Identifiable
as the cache key; I haven't tried it yet but I have a feeling they can make a good fit.
For B) and C), sorry I don't know much about Swift Concurrency yet. Leave these questions for other folks.
Do you have the link to the suggestion of AsyncStream
? It would be better to have more context.
This assignment happens after the subscription (.sink) to the items. However, switching the order of these two steps doesn't help. When subscribing (.sink), the closure will receive the current value. That seems to be how @Published works. Not sure if it's a way to avoid that.
I ran into this in a past project and ended up relying on this behavior and using dropFirst()
in the pipeline to ignore that first assignment. I'm not certain this is appropriate here, but it's worth trying.
Good news, we don't need to do any tricks like depending on dropFirst()
or removeDuplicates()
, the answer is much simpler and I don't even have to use Combine anymore. I just created a pull request (#22) addressing the performance issues, and the improvements are dramatic to say the least. Loading a Store with 8,000 items is 8-10x faster, down from 15 seconds to between 1-2 seconds on cold starts. Adding or removing items no longer has any performance penalty because I'm no longer rewriting every item every time an operation occurs (duh), so it only takes as long as it would to write those items to disk and to memory — both of which are very, very quick.
I'll leave this issue up for a little bit in case anyone has any comments, but also feel free to check out #22 if you're interested or want to provide some feedback.
Last but not least, thank you @pofat for the great suggestions and documentation of the problems, it was immensely helpful for me in figuring out how to improve Boutique's performance.
Merged in #22, thanks again for all the helpful suggestions!
This library is truly a nice and simple solution for a lightweight store. However, I noticed some potential performance problems. (It might be a little early for this topic but I saw you mentioned this is used on production)
self.$items.add([])
. You will find all files are written again even if there's no change.I can come up with a quick PR for 3. For 1, I guess it's the replay when you first subscribe to
items
. Need more investigation. For 2, I propose another property forold items
and diff algorithm to only persist diff items.What do you think?