Potential for significant perf improvements in large repos

JacksonKearl commented 3 years ago

First, thanks for your work on this project 🙂

The current implementation is fairly slow with large repos, for instance vscode, which has around 5000 files or typescript, which has around 50k. It takes about a minute to clone vscode with --singleBranch and --depth 1, and doesn't manage to clone typescript in the ~15 minutes I waited.

By adding batching to the indexdb writes (Put all writes into a single transaction rather than one transaction per file) and changing the autoinc in the cachefs to increment a counter rather than search for the highest inode (the search means writing N files is N^2 time), I am able to see vscode clone in ~20 seconds and typescript clone in about 2 minutes. this is approx 3x slower than native for vscode and 6x slower than native for typescript.

Batching:

diff --git a/idb-keyval.ts b/idb-keyval.ts
index 45a0d97..94920ef 100644
--- a/idb-keyval.ts
+++ b/idb-keyval.ts
@@ -2,10 +2,12 @@ export class Store {
   private _dbp: Promise<IDBDatabase> | undefined;
   readonly _dbName: string;
   readonly _storeName: string;
+  readonly id: string

   constructor(dbName = 'keyval-store', readonly storeName = 'keyval') {
     this._dbName = dbName;
     this._storeName = storeName;
+    this.id = `dbName:${dbName};;storeName:${storeName}`
     this._init();
   }

@@ -44,6 +46,31 @@ export class Store {
   }
 }

+class Batcher<T> {
+  private ongoing: Promise<void> | undefined
+  private items: { item: T, onProcessed: () => void }[] = []
+
+  constructor(private executor: (items: T[]) => Promise<void>) { }
+
+  private async process() {
+    const toProcess = this.items;
+    this.items = [];
+    await this.executor(toProcess.map(({ item }) => item))
+    toProcess.map(({ onProcessed }) => onProcessed())
+    if (this.items.length) {
+      this.ongoing = this.process()
+    } else {
+      this.ongoing = undefined
+    }
+  }
+
+  async queue(item: T): Promise<void> {
+    const result = new Promise<void>((resolve) => this.items.push({ item, onProcessed: resolve }))
+    if (!this.ongoing) this.ongoing = this.process()
+    return result
+  }
+}
+
 let store: Store;

 function getDefaultStore() {
@@ -58,10 +85,17 @@ export function get<Type>(key: IDBValidKey, store = getDefaultStore()): Promise<
   }).then(() => req.result);
 }

+const setBatchers: Record<string, Batcher<{ key: IDBValidKey, value: any }>> = {}
 export function set(key: IDBValidKey, value: any, store = getDefaultStore()): Promise<void> {
-  return store._withIDBStore('readwrite', store => {
-    store.put(value, key);
-  });
+  if (!setBatchers[store.id]) {
+    setBatchers[store.id] = new Batcher((items) =>
+      store._withIDBStore('readwrite', store => {
+        for (const item of items) {
+          store.put(item.value, item.key)
+        }
+      }))
+  }
+  return setBatchers[store.id].queue({ key, value })
 }

 export function update(key: IDBValidKey, updater: (val: any) => any, store = getDefaultStore()): Promise<void> {

Counter:

diff --git a/src/CacheFS.js b/src/CacheFS.js
index ed26c57..0dc6950 100755
--- a/src/CacheFS.js
+++ b/src/CacheFS.js
@@ -5,6 +5,7 @@ const STAT = 0;

 module.exports = class CacheFS {
   constructor() {
+    this._maxInode = 0
   }
   _makeRoot(root = new Map()) {
     root.set(STAT, { mode: 0o777, type: "dir", size: 0, ino: 0, mtimeMs: Date.now() });
@@ -38,16 +39,7 @@ module.exports = class CacheFS {
     return count;
   }
   autoinc () {
-    let val = this._maxInode(this._root.get("/")) + 1;
-    return val;
-  }
-  _maxInode(map) {
-    let max = map.get(STAT).ino;
-    for (let [key, val] of map) {
-      if (key === STAT) continue;
-      max = Math.max(max, this._maxInode(val));
-    }
-    return max;
+    return ++this._maxInode;
   }
   print(root = this._root.get("/")) {
     let str = "";

Please let me know if you'd consider incorporating these changes... the batching should be safe, I'm not super sure about the autoinc, but I don't see a reason why it would cause issues (the main difference is deleting a file would free up its inode value in the original implementation but doesn't here, but that shouldn't be a problem AFAIK)

JacksonKearl commented 3 years ago

cc @wmhilton

ncortines commented 3 years ago

I'm on the same boat as @JacksonKearl . isomorphic-git is very fast and has great potential. However, I see how the fs layer quickly becomes a bottleneck as things scale up. These are the result times from executing git.clone against different repository sizes using isomorphic-git + lightning-fs, on Chrome and Firefox, as well as from the terminal console:

Number of files	Chrome	Firefox	Terminal
1000	19s	5s	1s
10000	185s	102s	4s

I can see there is a separate effort #28, to support native fs in the future, but that feature seems to be still in a quite early phase of adoption.

I can understand the concept of batching write operations does kind of break the fs contract, as it exposes the fact that the implementation is backed by a database. However I can see how this issue can easily become a deal breaker for many.

@wmhilton , I'd love to hear your opinion and suggestions.

Thanks, Juan

JacksonKearl commented 3 years ago

I can understand the concept of batching write operations does kind of break the fs contract, as it exposes the fact that the implementation is backed by a database.

I don't see how this violates the FS contract? The contract only says that once I've called writeFile, I get a promise, and when that promise is resolved I can read the file. It doesn't require anything about the backend.

ncortines commented 3 years ago

@JacksonKearl I apologise if I haven't fully understood your proposal, but I got the idea was to wrap several file write operations in a single IDB transaction, rather than having a separate transaction for each file. In a pure fs implementation you wouldn't have to wait for a batch of file write operations to complete is order to start reading from (let's say) the first file in the batch. And I don't know if this is too relevant to isomorphic-git implementation, but perhaps it is a reason why it is like it is today.

JacksonKearl commented 3 years ago

I don't know what you mean by "pure" here. Any asynchronous filesystem (which is pretty much all of them, besides solely in-memory ones) doesn't guarantee being able to read until the write promise resolves. That's kinda the whole point really. That property is preserved here.

ncortines commented 3 years ago

I was referring to reading from a file already saved in the context of the current transaction, rather than the file being currently written. By wrapping all file write operations in a single transaction, the overall time required to write all files is significantly reduced but the time required to be able to read from the first file in the batch is significantly increased. It's a trade off. Operations like clone would surely benefit from this. Not sure about others.

billiegoose commented 3 years ago

@JacksonKearl That sounds really promising for speeding up git checkout! Have you got a branch I could look at?

Off the top of my head, I don't see how you could implement writeFile to do write batching universally, without adding a debounce that would slow down writeFile in the naive case (for (const file of files) { await writeFile(file, 'asdf') }).

JacksonKearl commented 3 years ago

@wmhilton no unfortunately we opted for a different route (downloading via archive endpoints) and are no longer using this project.

The code sample I included in the original post (and below) has an implementation of batching with no extra cost for single operations. Basically you send off operations immediately if there are none in progress, or batch them together if there are some in progress, and send them all off when the in progress batch finishes. It's the kind of thing that you'd think the DB would do for you but I guess not yet.

It's interesting to note that in my testing when firing a sequence of N single put requests the first one wont resolve until the last one is fired, so there is still some sort of batching going on, but just much less efficiently than grouping all changes into a single transaction.

The Batcher class:

class Batcher<T> {
  private ongoing: Promise<void> | undefined
  private items: { item: T, onProcessed: () => void }[] = []

  constructor(private executor: (items: T[]) => Promise<void>) { }

  private async process() {
    const toProcess = this.items;
    this.items = [];
    await this.executor(toProcess.map(({ item }) => item))
    toProcess.map(({ onProcessed }) => onProcessed())
    if (this.items.length) {
      this.ongoing = this.process()
    } else {
      this.ongoing = undefined
    }
  }

  async queue(item: T): Promise<void> {
    const result = new Promise<void>((resolve) => this.items.push({ item, onProcessed: resolve }))
    if (!this.ongoing) this.ongoing = this.process()
    return result
  }
}

(this is roughly inspired by the Throttler class used extensively internally in vscode)

ncortines commented 3 years ago

Another thing to watch out for is memory footprint, specially when dealing with fast emitting sources.

I think the File System Access API adoption (#28), plus support for writable streams, would bring an important leap in performance.

I would personally put my efforts on that front :)

isomorphic-git / lightning-fs

Potential for significant perf improvements in large repos #53