vrtmrz / obsidian-livesync

MIT License
4.69k stars 151 forks source link

Issue with large MD files #415

Open firstunicorn opened 5 months ago

firstunicorn commented 5 months ago

Abstract

Sync takes too long/buggy with large text (.MD) files.

Expected behavior

Sync should take almost the same time as small files.

Actually happened

Reproducing procedure

  1. Configure LiveSync as in the attached material: settttings-copy.md
  2. Create a 100-page file with a lot of TODO's - "- [ ]" (probably also use links, sub-tasks / sub-TODO items, some images, emoji, etc, etc).
  3. Make minor changes to the file. Add a few new TODOs, modify old ones.
  4. Wait when sync will begin automatically.
  5. Look at logs. Sync will take a lot of time.

Frequency: constantly

Plug-in log

Plug-in log ``` 5/1/2024, 7:01:42 PM->OneShot Sync begin... (sync) 5/1/2024, 7:01:43 PM->Replication paused 5/1/2024, 7:01:43 PM->Replication paused 5/1/2024, 7:01:43 PM->Replication completed 5/1/2024, 7:01:49 PM->Chunks saved: doc: TODO.md ,chunks: 2655 (new:2, recycled:2607, cached:46) 5/1/2024, 7:01:54 PM->STORAGE -> DB (plain) TODO.md 5/1/2024, 7:01:54 PM->OneShot Sync begin... (sync) 5/1/2024, 7:01:54 PM->OneShot Sync begin... (sync) 5/1/2024, 7:01:55 PM->OneShot Sync begin... (sync) 5/1/2024, 7:01:57 PM->Replication paused 5/1/2024, 7:01:57 PM->Replication activated 5/1/2024, 7:01:58 PM->Replication paused 5/1/2024, 7:01:58 PM->Replication completed ```

Screenshots

https://i.imgur.com/zFnzCZf.png

Details and background

First of all, thank you so much for your work. This plugin is a life changer for me and I transfer all my note-taking and task management to Obsidian on all devices thanks to this plugin. But recently I encountered some bug/drawback using CouchDB. When I have a file with a lot of text (like 100 pages... it looks like just a lot of tasks / TODO list, here is an example: https://i.imgur.com/zFnzCZf.png) it becomes slow. By slow I mean really slow as hell, it takes 10-30 seconds to upload any minor changes and about 10-30 seconds more to fetch on another machine. And with changes more than minor (a few paragraphs), it can take minutes in some cases. In the beginning, with 20 pages it became unusable in LiveSync mode (because while you typing, every character takes a few seconds to sync) and I switched to "Periodic and on Events" for every 4-second interval sync mode with batch syncing turned on and it worked for a while similar to real-time sync, but after 60 pages it again becomes slow down and periodic removing CouchDB on a hoster and create it again helps, but only for short period of time and after about 100 pages it becomes totally unusable, because on top of the slow sync - it slow my laptop (Obsidian become super slow, fans on laptop goes crazy... when I turn off Self-hosted LiveSync plugin - everything backs to normal). Not to mention it slows down syncing even for other, smaller files (but not so much, but still there is a huge difference... because at the beginning when there were just a few files and they were small, it was sync changes almost immediately and now even for small files it can takes few seconds - so the difference in hundreds percents). I use AlwaysData hosting free tier to host CouchDB. I think the issue is the amount of chunks and architecture of CouchDB and/or the process of cleaning and optimizing the database because, on a fresh database, everything syncs blazingly fast. Will switching to CloudFlare R2 storage solve this issue? Is it even worth trying?

vrtmrz commented 5 months ago

Thank you for your report! This problem has been caused by here: (new:2, recycled:2607, cached:46)

To less traffic, Self-hosted LiveSync splits the documents into multiple chunks and transfers only new things. However, checking existent from the local database is a bit heavy if there are a lot of them. And worse, this slows down according to their number.

To avoid this, Self-hosted LiveSync can cache recently used chunks. The initial value is configured in a little conservative value, but we should enlarge this in such a situation.

Please try setting Memory cache size (by total items) to 300000, and Memory cache size (by total characters) to 100 once. (Yes, this can be controlled by the count and total amount, for such a case).

I think that your environment could accept these values.

firstunicorn commented 5 months ago

Thanks for the reply! It seems it does help a bit, now it's like 10 seconds for one direction instead of 10-15 (or it's just a placebo? lol).

Also, I did experiments with caching in the past and it feels like it doesn't make any difference with any combination of parameters... the only thing that really worked at first is switching to "Periodic and on Events" for every 4-second interval sync mode with batch and now what helps a bit is only removing CouchDB on a hoster side and create it again with discarding local database, but only for short period of time.

Also, I noticed that right now I have 32000 chunks (or docs?)... and if I delete DB and reset everything, it shows just 6844 docs, which still seems a lot considering I have only 250 files (if this will help - a total size of the Vault is about 100MB with all attachments and images)... Maybe I'm mixing some terms about docs/chunks, so here the some logs after deleting and resetting everything:

Logs ``` 5/2/2024, 9:56:44 PM->Cache initialized 300 / 250000000000 5/2/2024, 9:56:44 PM->loading plugin 5/2/2024, 9:56:44 PM->Self-hosted LiveSync v0.23.3 0.23.3 5/2/2024, 9:56:44 PM->xxhash for plugin initialised 5/2/2024, 9:56:44 PM->Waiting for ready... 5/2/2024, 9:56:44 PM->Cache initialized 10 / 1000000000 5/2/2024, 9:56:44 PM->Cache initialized 300000 / 100000000 5/2/2024, 9:56:44 PM->Newer xxhash has been initialised 5/2/2024, 9:56:44 PM->Opening Database... 5/2/2024, 9:56:44 PM->Database info 5/2/2024, 9:56:44 PM->{ "doc_count": 6844, "update_seq": 6844, "db_name": "Obsidian Vault-anon-livesync-v2-indexeddb", "auto_compaction": false, "adapter": "indexeddb" } 5/2/2024, 9:56:44 PM->Database is now ready. 5/2/2024, 9:56:45 PM->Log window opened 5/2/2024, 9:56:46 PM->Initialize and checking database files 5/2/2024, 9:56:46 PM->Checking deleted files 5/2/2024, 9:56:46 PM->Checking expired file history 5/2/2024, 9:56:52 PM->There are no old documents 5/2/2024, 9:56:52 PM->Checking expired file history done 5/2/2024, 9:56:52 PM->Collecting local files on the storage 5/2/2024, 9:56:52 PM->Collecting local files on the DB 5/2/2024, 9:56:52 PM->Collecting local files on the DB: 25 5/2/2024, 9:56:52 PM->Collecting local files on the DB: 50 5/2/2024, 9:56:52 PM->Collecting local files on the DB: 75 5/2/2024, 9:56:52 PM->Collecting local files on the DB: 100 5/2/2024, 9:56:53 PM->Collecting local files on the DB: 125 5/2/2024, 9:56:53 PM->Collecting local files on the DB: 150 5/2/2024, 9:56:53 PM->Collecting local files on the DB: 175 5/2/2024, 9:56:53 PM->Collecting local files on the DB: 200 5/2/2024, 9:56:54 PM->Collecting local files on the DB: 225 5/2/2024, 9:56:54 PM->Collecting local files on the DB: 250 5/2/2024, 9:56:54 PM->Opening the key-value database 5/2/2024, 9:56:54 PM->Updating database by new files 5/2/2024, 9:56:54 PM->UPDATE DATABASE: Nothing to do 5/2/2024, 9:56:54 PM->UPDATE STORAGE: Nothing to do 5/2/2024, 9:56:57 PM->Initialized, NOW TRACKING! 5/2/2024, 9:56:57 PM->Cache initialized 300000 / 100000000 5/2/2024, 9:56:57 PM->Modifying callback of the save command 5/2/2024, 9:56:57 PM->Additional safety scan.. 5/2/2024, 9:56:57 PM->There are no conflicted files 5/2/2024, 9:56:57 PM->Additional safety scan done 5/2/2024, 9:57:19 PM->OneShot Sync begin... (sync) 5/2/2024, 9:57:20 PM->Looking for the point last synchronized point. 5/2/2024, 9:57:21 PM->Replication paused 5/2/2024, 9:57:22 PM->Replication activated 5/2/2024, 9:57:26 PM->↑0 ↓2 (6446) 5/2/2024, 9:57:27 PM->The request may have failed. The reason sent by the server: 404: Object Not Found 5/2/2024, 9:57:27 PM->{"error":"not_found","reason":"missing"} 5/2/2024, 9:58:25 PM->Replication closed ```

In summary, the less number of chunks/docs - the faster everything works, especially for large .MD files. Also, right now, after some experiments, I noticed that it syncs much faster on manual syncing with "Disable all automatic". So weird...

firstunicorn commented 5 months ago

Also, after increasing the size of the cache I noticed there is less stuck with the syncing process, which happens sometimes; but it doesn't seem to really affect the speed of syncing (in case when it is not stuck, of course).

vrtmrz commented 4 months ago

Thank you for confirming this! The situation remains a little strange. Another factor is that if the chunk size is not the same on all devices, the cache may not work properly and a large amount of space may be consumed. However, this can be checked in the latest version automatically.

Changing the configuration makes clear the cache. So, it might be more essentially a problem, like... just chunks are so much. In early versions, we could configure the chunk sizes. However, it has been hidden now.

Would you please try to increase minimumChunkSize in the data.json? The default is 20, however, it can be 200 in these files. (If we back this setting to the front again, we should implement an automatic adjusting feature. And if it works, I would love to).

Also, I noticed that right now I have 32000 chunks (or docs?)... and if I delete DB and reset everything, it shows just 6844 docs, which still seems a lot considering I have only 250 files (if this will help - a total size of the Vault is about 100MB with all attachments and images)... Maybe I'm mixing some terms about docs/chunks, so here the some logs after deleting and resetting everything:

Self-hosted LiveSync transfers files in small chunks, so these files are split into many of them. (The minimumChunkSize defines the size of the smallest chunk.) And this was also designed not to delete chunks once created in order to reuse them if they existed. This is efficient in reducing incoming traffic, but on the other hand, it creates a lot of unnecessary chunks. Therefore, at v0.23.4, I have implemented the feature Incubate Chunks in Document. It might be also effective for us.

Would you mind if I ask you to verify these two points?