pmem / pmdk

Persistent Memory Development Kit
https://pmem.io
Other
1.34k stars 510 forks source link

[pmempool create] should support auto growing poolsets (directories) #4223

Closed sscargal closed 1 year ago

sscargal commented 5 years ago

FEAT: 'pmempool create' should support creating auto-growing poolsets

Rationale

This feature request is different to the pmempool resize feature pmem/pmdk#4170 in that this one allows the user to create a poolset from the start that auto grows on demand to the limit of available space within the filesystem. The pmempool resize feature allows users to grow an existing single pool.

At creation time, it is not always known how large a pool should be. The amount of data plus space to grow is a good starting point. Poolsets support the ability to dynamically grow (in directory mode) by adding small 128MB pools to an existing poolset. This requires manual administration to create the initial DIRECTORY based poolset. Currently, pmempool create does not support this.

Description

It would be nice if pmempool create allowed the user to specify a base directory, initial size, optional max size, and growth chunk size.

From poolset(5) - http://pmem.io/pmdk/manpages/linux/master/poolset/poolset.5

DIRECTORIES
Providing a directory as a part’s pathname allows the pool to dynamically create files and consequently removes the user-imposed limit on the size of the pool.

The size argument of a part in a directory poolset becomes the size of the address space reservation required for the pool. In other words, the size argument is the maximum theoretical size of the mapping. This value can be freely increased between instances of the application, but decreasing it below the real required space will result in an error when attempting to open the pool.

The directory must NOT contain user created files with extension .pmem, otherwise the behavior is undefined. If a file created by the library within the directory is in any way altered (resized, renamed) the behavior is undefined.

A directory poolset must exclusively use directories to specify paths - combining files and directories will result in an error. A single replica can consist of one or more directories. If there are multiple directories, the address space reservation is equal to the sum of the sizes.

The order in which the files are created is unspecified, but the library will try to maintain equal usage of the directories.

By default pools grow in 128 megabyte increments.

Only poolsets with the SINGLEHDR option can safely use directories.

API Changes

pmempool needs to support files and directories as input arguments. Currently it assumes just a file.

Internally, no API changes should be needed. The feature is currently integrated into PMDK, we just need a user interface to set it up. The only caveat maybe if we want to support multiple directories or file systems to store the poolset parts in case one fills up.

Implementation details

The proposal would provide the following user command options and extend the use of existing ones (--size and --maxsize)

pmempool create [<options>] [<type>] [<bsize>] <file|directory>

Available options:
       -a, --autogrow

       Create an auto growing poolset

       -g, --growby <size>

       Auto growing poolsets will automatically grow by the given increment size.  Defaults to 128MiB.

       -M, --max-size <size>

       Set the maximum size of auto growing pools if a value is provided, or set the size of pool to available space of underlying file system if no value is provided.

       -s, --size <size>

       Size of pool file or initial size of auto growing pools.
Examples:

Example 1:
Create an auto growing pool within the mounted /mnt/pmemfs file system with an initial size of 1GiB.  This pool will grow until we run out of space in the file system

    pmempool create --autogrow --size=1GiB /mnt/pmemfs

Example 2:
Create an auto growing pool within the mounted /mnt/pmemfs file system with an initial size of 1GiB and a maximum size of 10GiB.  The pool will grow in 1GiB increments

    pmempool create --autogrow --size=1GiB --maxsize=10GiB --growby 1GiB /mnt/pmemfs

If we wanted to support concatenating or striping auto growing pools across multiple file systems, we should also allow this syntax:

pmempool create [<options>] [<type>] [<bsize>] <file|directory> ...

Example 3:
Create an auto growing pool within the mounted /mnt/pmemfs0 file system with an initial size of 1GiB and a maximum size of 10GiB.  The pool will grow in 1GiB increments

    pmempool create --autogrow --size=1GiB --growby 1GiB /mnt/pmemfs0 /mnt/pmemfs1 /mnt/pmemfs2

Meta

pbalcer commented 5 years ago

That's a good idea, and should be relatively simple to implement. 👍

seghcder commented 5 years ago

Coming from an operations perspective, can we also include shrinking pmem files? I can imagine its harder, as we'd need to move the live data back within the target smaller size like a defrag.

Another feature/option might be to allow adding (and removing a pool file to/from an existing poolset. (Edit - dynamic adding is supported based on man page )

Shrink might be better as its own FEAT, as I agree the ability to autogrow is a nice feature.

Is there also a way to get the "current utilisation" of a pool from a function call within libpmemobj while running?

pbalcer commented 5 years ago

We might implement pool shrinking after we implement pmem/pmdk#4187. Right now it's just not realistic. And yes, you should be able to add parts to an existing poolset.

How would you define "current utilization"? % of occupied space?

seghcder commented 5 years ago

In general yes. However with fragmentation it might be misleading too, so it can only be taken as an indicator.

Eg, If someone looks at their pool and it reports 10% free of a 100GB pool, they might expect 10GB contiguous free. Then a 5GB allocation fails because the largest contiguous space is only 4GB. Or they are confused why an autogrow was triggered when it seemed like there was sufficient space.

Another stat might "largest contiguous space"... might be hard or expensive to track though?

Ideally we could get the info from "pmempool info" through the API from within an app with the pool open. I understand pmempool info only works on offline pools at present. However, in future an app may want to provide an SNMP interface and/or send traps based on info from those stats.

Alternatively could pmempool info (and even check with no repair) be adapted to run in read-only mode against an open pool, and that way monitoring tools (Nagios, SCOM, SolarWinds etc) could write an agent for any pool.

pbalcer commented 5 years ago

I've been very reluctant to introduce any generic interfaces around space utilization for the reasons you list. Right now there's an API to retrieve the total size of allocated objects, but that might not be very useful.

Tracking largest free contiguous space - we could probably implement a rough estimate, but tracking this accurately would negatively impact scalability and overall performance. One statistic I can reasonably and efficiently expose is number of free/used/run chunks in the pool. This can be used to approximately deduce utilization for the pool. We could also track allocated/freed objects at the allocation class level.

sscargal commented 5 years ago

Within this feature enhancement, we should support the AUTO value for size within the poolset file when using FSDAX. Currently, directories support uses the as the maximum size.

From poolset(5):

The size argument of a part in a directory poolset becomes the size of the address space reservation required for the pool. In other words, the size argument is the maximum theoretical size of the mapping. This value can be freely increased between instances of the application, but decreasing it below the real required space will result in an error when attempting to open the pool.

AUTO works for devdax only.

Pools created on Device DAX have additional options and restrictions:

The size may be set to “AUTO”, in which case the size of the device will be automatically resolved at pool creation time.

In other words, the following myautogrowingpool.set configuration fails:

PMEMPOOLSET
OPTION SINGLEHDR
AUTO /pmemfs0/
# pmempool create --layout="mylayout" obj myautogrowpool.set
error: 'myautogrowpool.set' -- directory based pools are not supported for poolsets with headers (without SINGLEHDR option)
error: creating pool file failed

But this works:

PMEMPOOLSET
OPTION SINGLEHDR
10GiB /pmemfs0/
# pmempool create --layout="mylayout" obj myautogrowpool.set
#
# ls -lh /pmemfs0/*.pmem
-rw-rw-r--. 1 root root 8.0M Aug  5 05:08 /pmemfs0/000000.pmem

So the proposed -M, --maxsize <size> option should default to AUTO if no value is provided and we can determine the available capacity within the FSDAX filesystem at the time of pool creation. If the file system fills up, we'll return ENOSPC when trying to grow.

lplewa commented 5 years ago

@pbalcer How this feature is supposed to work?

As far I know autogrow is only supported for pooleset files. Should we create automatically a poolset in the given directory?

if yes, how i will open this pool? Should i use directory path as my pool or use poolset file created by pmempool(how a user will find it?)?

pbalcer commented 5 years ago

Not in the directory, but where the user specified. I think it should look like this:

pmempool create [<options>] [<type>] [<bsize>] <file>

Available options:
       -a, --autogrow [directories ...]

       Create an auto growing poolset

where the file is the poolset, so:

./pmempool create obj --autogrow /mnt/pmem --size=1GiB --maxsize=10GiB poolset.file

marcinslusarz commented 5 years ago

pmempool doesn't create poolset files by design. Autogrowing poolsets are no different than other types of poolsets, so I don't see why we should implement this particular feature, but leave other types out.

So the fundamental question is - what problem are we trying to solve here? The number of characters you have to type to create autogrowing poolset by hand would be similar to the number of characters needed to use this new feature, so this can't be just that...

pbalcer commented 5 years ago

I think the problem is discoverability of this feature. Most probably skip over the directories section or poolsets man page sections. Maybe we should add a command to create a poolset?

janekmi commented 1 year ago

This improvement is not considered vital at the moment. So, we do not have the resources to fulfil your request. Sorry.