pmem / ndctl

A "device memory" enabling project encompassing tools and libraries for CXL, NVDIMMs, DAX, memory tiering and other platform memory device topics.
Other
268 stars 138 forks source link

incorrect and inconsistent namespace sizing with --map argument #167

Open krisiasty opened 3 years ago

krisiasty commented 3 years ago

When the namespace of type fsdax (or devdax) is created in the default map mode (--map=dev), the space for necessary metadata is reducing the namespace size instead of being allocated from the available space in the region.

Although this behaviour is documented, it is not well understood and counter-intuitive, since the --size argument specifies the requested namespace size, not the underlying space required, so the result should be consistent and resulting namespace should provide exactly the space requested, no matter where the metadata is located (on the device itself or in the memory).

The current behaviour have several repercussions, for example the pmem-csi driver in the direct mode needs to manually compensate the requested pmem volume size (represented by namespace) to correctly create volume of requested size according to the CSI specification (which, by the way is failing to do so). Even though the general algorithm how much space is required is available, there is probably some additional alignment involved which makes calculations more complicated. Also, this puts requirement of knowing internal workings and how to calculate this overhead of the ndctl/libndctl on the external applications, which should be avoided. Imagine what would happen if the future version needs to change the implementation.

I understand that for backward compatibility with existing software you cannot change the current behaviour easily, but one of the possible solutions would be to introduce additional argument (switch) instructing the tool/library to allocate map outside the namespace if set, or from the namespace (current behaviour) if not present.

djbw commented 3 years ago

Fair critique, we'll take a look.

pohly commented 3 years ago

For PMEM-CSI, the functionality to create a namespace of exactly the requested size would have to be in libndctl. That's what we are using, not the code layered on top of the library in ndctl.

I recently tried to reimplement ndctl create-namespace --reconfig based on libndctl and eventually gave up in favor of invoking the binary because quite a lot of code was only found in the binary and would have to be repeated.

okartau commented 3 years ago

Having the logic turned around so that what we ask is what we get, (i.e. metadata size gets added internally instead of subtracted from usable) would be the best from pmem-csi POV. If that's not possible, can you show a formula to estimate the overhead so that pmem-csi can increase the size before making request? The formula available now suggests 64 bytes per 4096 block, but in reality such formula will not work, as trials show we need more. The overhead value also varies significantly depending on selected alignment value. Few fsdax-mode allocation trials (sizes 120GiB, 240GiB, 1500GiB) show that real overhead is:

pohly commented 3 years ago

FYI, in PMEM-CSI we started using the same approach as the current ndctl create-namespace --size: the size refers to how much PMEM is consumed from the region, not the size of the resulting block device. This may surprise users, but it is consistent and predictable.

Perhaps ndctl list could be enhanced to show the "raw size" (not sure what to call it) of a namespace? That makes it more obvious to the user that he got a 1GiB namespace when asking for it with --size 1G. Right now it is confusing because only the smaller size of the block device is shown.

As discussed on IRC, the information is available (ndctl_pfn_get_resource(pfn) - ndctl_namespace_get_resource(ndctl_pfn_get_namespace(pfn))).