Subtypes of Special Devices (metadata and special_small_blocks)

jmmitc06 commented 9 months ago

Describe the feature would like to see added to OpenZFS

The recently added special device is a fantastic addition; however, it requires that the special_small_blocks, if used, be stored on the same device as the metadata records. This is problematic for small special devices such as optane NVMe's or use cases where the number of special_small_blocks is very large. I believe it would be helpful to have the option to have the special_small_blocks be stored on a different device than the metadata. In essence, this would allow for sub-types of special devices, some for metadata only and some for special small blocks only.

How will this feature improve OpenZFS?

This would allow for the better utilization of storage devices in their optimal use-cases. For instance, my workload has a large number of small files as well as very large files. The small blocks are too large in total to be stored on my 58gb optane modules so I use mirrored sata SSDs for special. The SSD pool is fast, but the low QD1 performance of optane would be better for metadata; however, it is a bit of waste to have all the small records stored on the optane devices. Thus I'm left either using the suboptimal SSDs for special or giving up small_blocks to use the optane. If I could use the optane devices for metadata-only special and the SSD pool for special_small_blocks-only special the devices would be used more optimally.

Additional context

This is an admittedly edge case perhaps; however, I believe that this would allow more flexibility in pool architecture. Separating metadata from small_blocks would allow hybrid pools such as the one I desire with a small but very fast metadata pool, a larger but less fast SSD pool for small files, and a large HDD pool for the large files.

tonyhutter commented 9 months ago

Not quite an answer to your question, but have you considered using the Optane device as a cache device in your pool?

jmmitc06 commented 9 months ago

Actually that's what I am doing currently. It works pretty well but I wish my writes would happen at optane speeds which isn't possible with cache (not that it really matters given my networking is slower than my pool).

tonyhutter commented 9 months ago

I wish my writes would happen at optane speeds which isnt possible with cache

If that's what you're after, you could try using it as a log device. I know people have used Optane drives in the past for that.

jmmitc06 commented 9 months ago

Yep, I have 8gb of the 58gb optane devices in a mirror for log and the rest as cache.

amotin commented 9 months ago

From my experience, the low-capacity Optanes, while having nice write latency, just do not have sufficient write throughput. So their usability as SLOG is limited, unless you build some basic ~1Gbps scale NAS. Later ones like 905p of 480GB can do more decent 2.2GB/s, but that also allows to get only to ~10Gbps scale. Even later Optane RAM aka DCPMM, while able to read as fast RAM, and it can be extremely fast on short writes, on large writes also gets too slow to be SLOG. It has much better write cycles (TBW) than flash, but to get decent throughout it needs much more capacity that SLOG require. So I would use small Optane either as special vdev or at L2ARC. In the last case you can at least tune ZFS L2ARC write speed higher for faster L2ARC refill without worrying about wearing device out too soon. Couple of larger 480GB Optanes I use by themselves as primary storage in my development server where I don't need more capacity.

openzfs / zfs