minht11 / local-music-pwa

Lightweight on device music player PWA.
https://snaeplayer.com/
MIT License
154 stars 20 forks source link

Poor performance when checking if track already exists #8

Open lennyanders opened 3 years ago

lennyanders commented 3 years ago

Currently, when adding new tracks to the library, each tracks file is compared to each of the already existing tracks files. isSameEntry is used for this and is the only by the browser provided method I know to do this. Sadly, this is really slow. When I compare ~3500 files to the same ~3500 files, it takes around 13 minutes. On top of that, the added files get first parsed and only after that get compared to the existing tracks. I would say that, at the time of importing (selecting a directory) the files should be checked, so that after that only the new files get parsed.

For improving the main problem, I have an idea, but I don't know if it will work yet. I'm open to other ideas. My plan would be, to save the path of each file, relative to the selected directory, and map that to their corresponding FileSystemHandle (or a custom object containing that somewhere). Then all the path mappings should be saved in the indexed db. Also, the directly imported directories need to be saved in the db and should have some kind of relation to the file path mappings. After that, when importing the same folder again, we could check if that newly imported folder exists in our database and could the compare the newly generated file path mappings to the one belonging to the existing directory in the db.

When selecting a directory that is a parent of a directory in the db, we could check the difference in file path mappings from the point we get to the directory that already is in the db. This would mean that we needed to check every sub-directory when importing against our saved import directories. This should still be way more performant than the current implementation, but is not ideal.

I also don't have a good idea how to handle an import of a sub-directory of an import directory that is already in the db. We could probably store all scanned directories in the db, and compare every directory to every directory, but I think performance would get with a relatively big amount of directories pretty unperformant too.

The first step would be to implement just the path mapping from the imported folder and use my hopefully performant concept when the exact same directory is used again, and fall back to the current method if not.

I would be interested in implementing, or helping with the implementation of, this. I will try it nonetheless and will probably share a CodePen or something here, if my plan works out.

I hope everything I wrote here makes sense.

minht11 commented 3 years ago

So I made quick test, open file, rename it and try to filehandle.getFile(). It fails saying A requested file or directory could not be found at the time an operation was processed. That's a good thing, because that means we can ignore possibility of renaming.

FileSystemHandle has a name property which is the exact same as filename, so if we want to compare two files we only need use isSameEntry on ones which have exact same names and ignore the rest. That should dramatically improve performance, could you test if that's true?

lennyanders commented 3 years ago

That takes it down from 13 to 1.5 minutes. I would still like to try my idea. But this little change can be implemented nonetheless and is a quick win. It would even improve my first implementation, where I would fall back to the old method in many scenarios.

minht11 commented 3 years ago

On top of that, the added files get first parsed and only after that get compared to the existing tracks. I would say that, at the > time of importing (selecting a directory) the files should be checked, so that after that only the new files get parsed.

Originally new files data would overwrite old ones, but that got lost when I changed things, since files rarely change ignoring existing files seems like a good idea they definitely shouldn't be parsed again.

On path mapping, I didn't even know that relative file paths were exposed and I am certainly not opposed on seeing it implemented as a prototype.

Also comparing logic should probably move to the worker thread, so it doesn't block main thread as much. And only new tracks are sent to the main thread. But that's for the future.

lennyanders commented 3 years ago

Originally new files data would overwrite old ones, but that got lost when I changed things, since files rarely change ignoring existing files seems like a good idea they definitely shouldn't be parsed again.

For that, I would store File.lastModified in the track data and compare that on second import.

On path mapping, I didn't even know that relative file paths were exposed and I am certainly not opposed on seeing it implemented as a prototype.

It doesn't, but you can write something that concatenates the names of the folders and files while iterating over them. I wrote a little function that just does that, I did not start saving that data somewhere and comparing it to something, but that's the idea:

interface SnaeFile {
  rootDirectoryHandle: FileSystemDirectoryHandle;
  directoryHandle: FileSystemDirectoryHandle;
  fileHandle: FileSystemFileHandle;
}

const getAllFileHandlesFromDirectory = async (
  directoryHandle: FileSystemDirectoryHandle,
  path = directoryHandle.name,
  rootDirectoryHandle = directoryHandle,
) => {
  const fileSystemHandles: Record<string, SnaeFile> = {};
  for await (const fileSystemHandle of directoryHandle.values()) {
    if (fileSystemHandle.kind === 'directory') {
      Object.assign(
        fileSystemHandles,
        await getAllFileHandlesFromDirectory(
          fileSystemHandle,
          `${path}/${fileSystemHandle.name}`,
          rootDirectoryHandle,
        ),
      );
    } else {
      fileSystemHandles[`${path}/${fileSystemHandle.name}`] = {
        rootDirectoryHandle,
        directoryHandle,
        fileHandle: fileSystemHandle,
      };
    }
  }
  return fileSystemHandles;
};

Also comparing logic should probably move to the worker thread, so it doesn't block main thread as much. And only new tracks are sent to the main thread. But that's for the future.

Yeah, even saving things to the indexed db could be done in a worker. Maybe an extra SharedWorker just for accessing the database would be a good idea.

minht11 commented 3 years ago

Originally new files data would overwrite old ones, but that got lost when I changed things, since files rarely change ignoring existing files seems like a good idea they definitely shouldn't be parsed again.

For that, I would store File.lastModified in the track data and compare that on second import.

To access file from FileHandle you need to get user permission first each time tab is opened, that doesn't seem worth given that files rarely change, I can't imagine user editing metadata of mp3 file that often.

In MDN there is example of directoryHandle.resolve(handle) I am not quite sure what it does but seems to return relative paths of given file handle. That probably is still slow, and your approach is likely faster.

Yeah, even saving things to the indexed db could be done in a worker. Maybe an extra SharedWorker just for accessing the database would be a good idea.

We can't use SharedWorker because Safari unlikely to implement it any time soon, and making it work otherwise doesn't seem worth it.

Currently all data is managed inside SolidJS createStore, IndexedDB is only called when data inside that store changes. Basically a redux. Doing the reverse is tricky, indexedDB doesn't provide a way to subscribe to changes, so you still need notify main thread manually, with new data.

minht11 commented 3 years ago

@lennyanders could you check by how much https://github.com/minht11/local-music-pwa/pull/9 improves things? Moving diffing to web worker and using relative paths for comparison are still thing I would want to do.

lennyanders commented 3 years ago

I left a comment in the PR.

I also have a test now for my method, and it already works really well: chrome-fast-fs-diffing The diffing only takes a few milliseconds, and I also sped up the initial file reading a bit. I still need to figure out how ta handle a case where I select a directory and the directory is a parent directory of two or more directories, that already got parsed.

One thing that would be good for this to work properly, is some kind of list with the selected directories in the settings/library page. Just a simple list with the selected directories, with an x icon or something to remove each directory. This would bring the Player also more close to other "traditional" music players.

minht11 commented 3 years ago

I really like the approach of saving directoryHandle together with the file. Maybe directoryHandle.resolve() could be used to check if one directory is parent of the other?. That should be relatively cheap since there shouldn't be that many user added directories.

One thing that would be good for this to work properly, is some kind of list with the selected directories in the settings/library page.

I had that at the beginning, when moving from Electron to web app. I thought it was bit confusing since you could only can show directory names, since paths are unavailable, so what if you add folders with same name:

  • Music
  • Music

How do you know which is which? Maybe showing track count in each folder could help. I wish web supported open this folder api, but besides that I am in favor on adding this.

lennyanders commented 3 years ago

I really like the approach of saving directoryHandle together with the file. Maybe directoryHandle.resolve() could be used to check if one directory is parent of the other?. That should be relatively cheap since there shouldn't be that many user added directories.

When you look into my repo, you can see that I already use directoryHandle.resolve() (thanks for finding that cool function btw). i only need to check the case where a new directory is a parent of multiple directories. I have it working for one. I also added directory handle to each file so I could search with that for an album coverm named cover.jp or cover.png, in the directory the tracks of an album are in. (at least in my player at sometime, but I guess it's in general a good functionality provided by many players)

I had that at the beginning, when moving from Electron to web app. I thought it was bit confusing since you could only can show directory names, since paths are unavailable, so what if you add folders with same name:

Brilliant how you had the same thoughts as me, just way earlier than me.

How do you know which is which? Maybe showing track count in each folder could help.

This is also the only thing I had in mind and I think it's better than not having a list of directories and not having the ability to re-scan the selected directories.

minht11 commented 3 years ago

Opps didn't mean to close this. While performance is good enough for now it still can be improved.

Sorry I only briefly looked at your repo before. The whole approach looks good.

The cover art thing would be a separate discussion at the later time, but yeah your after these changes it would be hard to scan for artwork files.. I wonder even if app itself should store artwork as hidden files on the filesystem instead of in indexedDB.

lennyanders commented 3 years ago

I updated my repo now, so if you select multiple directories on one layer and then select a parent directory of these, the old filenames get updated and than everything gets scanned again and I can check which files are new and also which are removed.

minht11 commented 3 years ago

For the most part everything looks good and I think it's worth doing it. Storing relative paths as ids is great idea.

Right now I have pending moderate code and ui design changes which impact a lot places. So maybe next week, after that is done, we can start looking into how to make it work here.

lennyanders commented 1 year ago

Hey,

I finally looked a bit more into this, and implemented checking for changed files too. It was pretty easy too, I just save the lastModified date in the indexed db, and after a new check I check for all files if it has changed. This can be seen here: https://github.com/lennyanders/euphonium/blob/4a6adc1293ccbc5f5e069c52956b8ee1d4e49141/src/worker/files/diffFiles.ts#L15

Overall I think my current logic for getting and updating track data is pretty good (I think I can do some smaller optimizations and code reductions but nothing major) and you can look it up here: https://github.com/lennyanders/euphonium/blob/4a6adc1293ccbc5f5e069c52956b8ee1d4e49141/src/worker/library.ts#L78

I guess that would be everything from me. I think it would be really cool if you implement it, but yeah, I build my own thing now, so for me personally, it's not that important anymore.