urbit / vere

An implementation of the Urbit runtime
MIT License
55 stars 37 forks source link

Ridiculous amounts of disk writes every 2 minutes #429

Open Rudd-O opened 1 year ago

Rudd-O commented 1 year ago

Every exactly 2 minutes, Urbit's vere process writes hundreds and hundreds of megabytes of data to disk, complete with fdatasync() and all. Here is a biotop trace:

08:36:17 loadavg: 0.27 0.13 0.06 3/182 1194

PID    COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms
103    kworker/u4:1     W 202 16  xvdb      6241 213316.0   0.74
454    jbd2/xvdb-8      W 202 16  xvdb         1     4.0   0.15
1193   biotop           R 202 0   xvda         1     4.0   4.43
0                       R 202 16  xvdb         2     0.0 171.88

08:36:18 loadavg: 0.27 0.13 0.06 1/182 1194

PID    COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms
782    .run             W 202 16  xvdb      2112 125232.0   0.96
103    kworker/u4:1     W 202 16  xvdb      1874 77356.0   1.59
454    jbd2/xvdb-8      W 202 16  xvdb         1    24.0   0.76

08:36:19 loadavg: 0.27 0.13 0.06 1/182 1194

PID    COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms

08:36:20 loadavg: 0.27 0.13 0.06 1/182 1194

PID    COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms
454    jbd2/xvdb-8      W 202 16  xvdb         1     4.0   0.55
0                       R 202 16  xvdb         1     0.0 2305.72

08:36:21 loadavg: 0.33 0.15 0.07 1/182 1194

PID    COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms
103    kworker/u4:1     R 202 16  xvdb        29 416304.0   0.17
759    .run             W 202 16  xvdb       320  1344.0   0.16
454    jbd2/xvdb-8      W 202 16  xvdb         2   112.0   2.13
197    jbd2/xvda3-8     W 202 0   xvda         2    36.0   0.59
0                       R 202 0   xvda         2     0.0   1.01
0                       R 202 16  xvdb        97     0.0   7.14

detailed biosnoop writes.txt

I am using the latest vere, but this has been like that for at least 6 months since I started collecting node exporter data using Prometheus. I run 35 VMs on this system, some sporting quite write-heavy workloads, yet Urbit alone is responsible for over 90% of all disk writes at any point in time. Note that the Urbit log itself does not increase in size over time — it is stable at around 4 GB on my system.

What could vere possibly doing to warrant half a gigabyte of writes every two seconds? Is it rewriting a quarter of the log just for shits and giggles?

Attached in comment below, a detailed biosnoop log of the event, and evidence of the absurd volume of Urbit disk writes over time.

Rudd-O commented 1 year ago

detailed biosnoop writes.txt

urbit is crazy

matthew-levan commented 1 year ago

vere-v2.7 is running a snapshot validation function every time it takes a snapshot, which is most likely the culprit here. We are working on resolving this issue in #428.

ashelkovnykov commented 1 year ago

@Rudd-O

but this has been like that for at least 6 months since I started collecting node exporter data using Prometheus

This part is confusing me - are you saying that the issue you're describing has been happening for 6 months? As @matthew-levan says, there was a bug introduced in Vere 2.7 which caused the exact symptoms you're describing, but it was release only this past Tuesday.

Rudd-O commented 1 year ago

I'm actually running vere 2.5. This problem has existed for many months. The exact start date was January 5. At the time I was running vere v1.13. You can see in the graph that things "normalize" (still to an absurd degree of disk writes) around February 7-8.

The chart shows the total daily writes per VM, where Urbit is by far the most dominant one. Please note the volume of writes — the Y axis is denominated in megabytes. During the past four months, Urbit has written 360 gigabytes on average every day. This is mostly rewriting content in place, it seems, since the log file is clearly not being added significantly to.

urbit crazy writes again

file times

mopfel-winrux commented 1 year ago

Vere writes a snapshot every 2 minutes which is probably what youre seeing. There is a flag --snap-time you can use to set the snapshotting rate in minutes.

Rudd-O commented 1 year ago

Yes I am seeing the snapshot happen, but why does vere need to write hundreds of megabytes during the snapshot?

mopfel-winrux commented 1 year ago

A snapshot is your loom(Urbit's memory arena) snapshot. This is done so you don't have to replay multiple events if vere didn't gracefully close. The 2 minute is the default time but you can change it to a longer time if youre hosting on a stable platform, or if you don't mind a longer replay on a crash.

Rudd-O commented 1 year ago

I don't think we're making progress here, and this is getting to be increasingly frustrating for me. I know what the snapshot is. I am familiar with Urbit and I have been running it for close to three years now.

My question, which I repeat once again here, is this:

What possible excuse is there for vere to write hundreds of megabytes every two minutes during the snapshot, for a grand total of 360 gigabytes per day? My ship is not a galaxy; it is not even a star. I am not seeing ginormous levels of activity either. I am subscribed to perhaps two dozen groups. What can otherwise be happening over the last 4 months that leads to this pathological behavior which will destroy my SSDs?

If you could please give me some information to go on and debug the problem, that would be phenomenal. Thanks in advance.

matthew-levan commented 1 year ago

@Rudd-O just make set the snapshot interval to something higher than 2 minutes with urbit --snap-time 60 or something. Note that the value is in minutes there. That should do the trick. That will reduce your total daily writes from ~360GB to ~12GB.

Re: your question, it is normal for Urbit to write hundreds of megabytes every two minutes, as the default snapshot interval is 2. Maybe we should increase that default to something like 30 or 60. Urbit has written snapshots to disk every 2 minutes as long as I can remember, which is longer than 4 months, so I'm not sure about timing but I can say definitively that the snapshot interval is the reason why you're experiencing high daily write volume.

Rudd-O commented 1 year ago

it is normal for Urbit to write hundreds of megabytes every two minutes, as the default snapshot interval is 2.

I don't get why it would be normal to write hundreds of megabytes on the log for a ship that literally spends idle 99.9% of the time. Given the lifetime of consumer SSDs vis a vis writes, I would not call that "normal", I would call that criminal. I'm running a full Bitcoin client on the same machine, there's a 680 GB or so of blockchain in its disk, yet it writes one fiftieth of the data volume.

Am I to understand that pretty much every Urbit user out there experiences this exact behavior? I refuse to believe this is normal, and my charts clearly show that it wasn't happening four months ago.

belisarius222 commented 1 year ago

This is indeed a lot of data to write for snapshots. You can tune your snapshot interval as described above, but we might want to increase the default snapshot interval.

For development, it's nice to have a quick snapshot interval so that if a ship crashes, the average number of events to rerun is low enough that restarting your pier is quick. Real ships that aren't used for development, though, aren't usually restarted often at all. I propose keeping the 2-minute default snapshot interval for fake ships but increasing it to 60 minutes for real ships.

@Rudd-O I'd be curious to see the amount of disk writing your ship does at 10, 30, and 60 minute snapshot intervals, and you probably want 60 for your use-case. Our guess is that each disk write won't increase much from the 125MB that you're seeing despite the much larger number of events in each interval. The reason is that 125MB is likely already much higher than the amount of new data added to the system during that time -- there's likely some write amplification from free list management, home road fragmentation, and reference count changes, similar to what you might see in a database.

Running |meld could also reduce the write size for each snapshot for a while, so you might want to try that.

Rudd-O commented 1 year ago

Will first try meld and then report.

Rudd-O commented 1 year ago

|meld seems to have vastly reduced the amount of writes that happen every two minutes. Went from 60-120 MB/s for several seconds to ~4 MB/s.

joemfb commented 1 year ago

Worth noting that |mass and |pack also dirty every page of the snapshot (due to "intrusive" metadata accounting), so running those frequently will result in significant write amplification.

Rudd-O commented 7 months ago

Should be reopened. Meld only temporarily fixes the issue. A few weeks later, the increase in writes is evident.

Rudd-O commented 7 months ago

Spot the melds:

Screenshot_20240127_172259

(December 7 and Jan 26.)

It's not acceptable to be writing 8 megabytes per second on a fully idle Urbit ship. I may just have to shut it down for good.