hpc / mpifileutils

File utilities designed for scalability and performance.
https://hpc.github.io/mpifileutils
BSD 3-Clause "New" or "Revised" License
162 stars 64 forks source link

[Request] add support for project id (i.e. lsattr -p) #553

Open markmoe19 opened 10 months ago

markmoe19 commented 10 months ago

dwalk supports group and gid. Could it also support project and project id? See "/usr/bin/lsattr -p" for project id linux attribute on a file. We make use of gid for access and project id is often the same but sometimes different and used more for quota control. Thank you.

adammoody commented 9 months ago

I haven't used project ids much myself.

Do you know if there is a C API to access that info?

Are those store as extended attributes maybe, and if so, do you know the attribute name?

As with the dwalk formatting change, we could hack something up to get you a quick solution here. If there is a C API, we could likely just call that at the point where the text lines are formatted. If there is not a C API, we could try to invoke and parse the lsattr command through a popen() call instead, though I don't know how well popen() calls are supported in MPI environments (forking procs sometimes causes problems).

adammoody commented 9 months ago

For reference, I see the lsattr man page points to sourceforge

lsattr is part of the e2fsprogs package and is available from http://e2fsprogs.sourceforge.net.

which then points to this git repo:

https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/tree/lib

Looks like this is the code that reads the file project id:

https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/tree/lib/e2p/fgetproject.c#n42

the fsxattr struct and FS_IOC_FSGETXATTR constant are defined here:

https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/tree/lib/e2p/project.h

adammoody commented 9 months ago

Another reference to project id info on Lustre: https://jira.whamcloud.com/browse/LU-4017

@ofaaland , do you know offhand if there is an API to read a file project id (lsattr -p) in Lustre?

adilger commented 9 months ago

@adammoody there is an ioctl(FS_IOC_FSGETXATTR) interface to get the projid of a file/directory, and similarly ioctl(FS_IOC_FSSETXATTR) to set the projid on files/dirs and FS_FL_PROJINHERIT flag on directories.

This same ioctl is implemented for Lustre, XFS (original), and other filesystems that support project IDs (possibly ZFS also). See, for example lustre/utils/lfs_project.c::project_get_fsxattr() in the Lustre tree. The struct and constants are already declared in the Linux kernel <Linux/fs.h> header.

I definitely would not recommend to do an lsattr system call to get this information, as that is a ton of overhead, and makes the code dependent on the installation of e2fsprogs and the exact syntax and output of that tool. While it is unlikely to change, that adds needless complexity.

adammoody commented 9 months ago

Thank you, @adilger ! That's perfect.

@markmoe19 , for a quick fix, you could hack this in to the text formatting function that you modified in https://github.com/hpc/mpifileutils/issues/555.

Here is some example code I whipped up based on the tips above that I think will work:

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <stdint.h>

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

#include <sys/ioctl.h>
#include <linux/fs.h>

static int get_projid(const char* path, uint32_t* id)
{
  int ret = 0;
  int fd = open(path, O_RDONLY);
  if (fd >= 0) {
    struct fsxattr attr;
    int rc = ioctl(fd, FS_IOC_FSGETXATTR, &attr);
    if (rc == 0) {
      *id = (uint32_t) attr.fsx_projid;
    } else {
      printf("ioctl failed %s %s (errno=%d)\n", path, strerror(errno), errno);
      ret = -1;
    }
    close(fd);
  } else {
    printf("open failed %s %s (errno=%d)\n", path, strerror(errno), errno);
    ret = -1;
  }
  return ret;
}

int main(int argc, char* argv[])
{
  char path[] = "/p/lustre/user123/foo.dat";

  uint32_t id;
  int rc = get_projid(path, &id);
  if (rc == 0) {
    printf("%s %llu\n", path, (unsigned long long)id);
  }

  return 0;
}

You can add the definition for get_projid as above to src/common/mfu_flist_io.c and then call get_projid(file, &id) for each file within print_file_text().

That will execute twice for each file, once to fetch the value when computing the number of characters needed, and once more to fetch the value for actual writing to the file. That obviously could be improved by caching the value, but at least it would work to just do it twice.

Since this is an uint32_t type, I suppose it'd be best to print with "%"PRIu32.

adilger commented 9 months ago

Opening and closing the file twice is not free, even in newer Lustre releases that have automatic open cache on the client. Ideally, you would open an fd on the path once at the starts, and pass the fd to the various functions. That also avoids TOCTOU race when doing pathname resolution multiple times.

adammoody commented 9 months ago

Reducing the open calls is a good point, especially if you are dealing with many files, @markmoe19 . If that's a problem, the most immediate improvement would be to pull the project id lookup up into mfu_flist_write_text() for all files and then pass those project id values into print_file_text(). I'd go ahead and try the double-open hack first at least to check that it does what you need. Then we can optimize that if needed.

adammoody commented 9 months ago

In case you want to jump straight to the single-open version, here is a patch:

diff --git a/src/common/mfu_flist_io.c b/src/common/mfu_flist_io.c
index 0b0a2e5..7acb60d 100644
--- a/src/common/mfu_flist_io.c
+++ b/src/common/mfu_flist_io.c
@@ -22,6 +22,10 @@
 #include <errno.h>
 #include <string.h>

+/* to read file project id via ioctl */
+#include <sys/ioctl.h>
+#include <linux/fs.h>
+
 #include "dtcmp.h"
 #include "mfu.h"
 #include "mfu_flist_internal.h"
@@ -1591,9 +1595,29 @@ void mfu_flist_write_cache(
     return;
 }

+static int get_projid(const char* path, uint32_t* id)
+{
+    int ret = -1;
+    int fd = open(path, O_RDONLY);
+    if (fd >= 0) {
+        struct fsxattr attr;
+        int rc = ioctl(fd, FS_IOC_FSGETXATTR, &attr);
+        if (rc == 0) {
+            *id = (uint32_t) attr.fsx_projid;
+            ret = 0;
+        } else {
+            MFU_LOG(MFU_LOG_ERR, "ioctl failed: `%s' errno=%d (%s)", path, errno, strerror(errno));
+        }
+        close(fd);
+    } else {
+        MFU_LOG(MFU_LOG_ERR, "open failed: `%s' errno=%d (%s)", path, errno, strerror(errno));
+    }
+    return ret;
+}
+
 /* TODO: move this somewhere or modify existing print_file */
 /* print information about a file given the index and rank (used in print_files) */
-static size_t print_file_text(mfu_flist flist, uint64_t idx, char* buffer, size_t bufsize)
+static size_t print_file_text(mfu_flist flist, uint64_t idx, const uint32_t* projids, char* buffer, size_t bufsize)
 {
     size_t numbytes = 0;

@@ -1640,9 +1664,11 @@ static size_t print_file_text(mfu_flist flist, uint64_t idx, char* buffer, size_
         const char* size_units;
         mfu_format_bytes(size, &size_tmp, &size_units);

-        numbytes = snprintf(buffer, bufsize, "%s %s %s %7.3f %3s %s %s\n",
+        uint32_t projid = projids[idx];
+
+        numbytes = snprintf(buffer, bufsize, "%s %s %s %7.3f %3s %s %lu %s\n",
             mode_format, username, groupname,
-            size_tmp, size_units, modify_s, file
+            size_tmp, size_units, modify_s, (unsigned long)projid, file
         );
     }
     else {
@@ -1690,12 +1716,23 @@ void mfu_flist_write_text(
         MFU_LOG(MFU_LOG_INFO, "Writing to output file: %s", name);
     }

-    /* compute size of buffer needed to hold all data */
-    size_t bufsize = 0;
     uint64_t idx;
     uint64_t size = mfu_flist_size(flist);
+
+    /* lookup file project id values */
+    uint32_t* projids = (uint32_t*) MFU_MALLOC(size * sizeof(uint32_t));
+    for (idx = 0; idx < size; idx++) {
+        const char* file = mfu_flist_file_get_name(flist, idx);
+        uint32_t projid;
+        if (get_projid(file, &projid) == 0) {
+            projids[idx] = projid;
+        }
+    }
+
+    /* compute size of buffer needed to hold all data */
+    size_t bufsize = 0;
     for (idx = 0; idx < size; idx++) {
-        size_t count = print_file_text(flist, idx, NULL, 0);
+        size_t count = print_file_text(flist, idx, projids, NULL, 0);
         bufsize += count + 1;
     }

@@ -1706,7 +1743,7 @@ void mfu_flist_write_text(
     char* ptr = buf;
     size_t total = 0;
     for (idx = 0; idx < size; idx++) {
-        size_t count = print_file_text(flist, idx, ptr, bufsize - total);
+        size_t count = print_file_text(flist, idx, projids, ptr, bufsize - total);
         total += count;
         ptr += count;
     }
@@ -1811,6 +1848,9 @@ void mfu_flist_write_text(
     /* free buffer */
     mfu_free(&buf);

+    /* free buffer of project ids */
+    mfu_free(&projids);
+
     /* end timer */
     double end_write = MPI_Wtime();

I can push up a branch if you'd like.

This code doesn't properly handle the case when get_projid() returns an error. It should at least print an error message to stdout, but it will use an undefined value as the file's project id. To better handle that error case, we should also be tracking whether each entry in the projids array is actually valid.

Perhaps change this to use a signed datatype and then use -1 to represent an invalid project id? Or otherwise use a second array or an array of structs or something to denote whether the corresponding entry is valid...

Also, when the project id has not been explicitly set (or inherited) on a file, does the project id default to 0? Anyone know?

adammoody commented 9 months ago

Went ahead and pushed this a branch to make it easier to try:

https://github.com/hpc/mpifileutils/tree/projid

and here are the actual changes:

https://github.com/hpc/mpifileutils/compare/main...projid

markmoe19 commented 9 months ago

That works! I get a lot of projid='0', but project id is not set for all the files so that makes sense. I do get 10001, 10002, and 20034 as some other example project-ids.

markmoe19 commented 9 months ago

questions:

1) what happens when the .mfu file doesn't contain the projid data? Or is that not part of the .mfu file yet? Since projid might be a new addition to the .mfu file, could it just be assumed to be value "0" for any .mfu files that don't contain the project id?

2) sometimes I get an error like this:

[2023-10-09T12:45:00] [44] [/project/selene-admin/mpifileutils_debug/mpifileutils-v0.11.1/mpifileutils/src/common/mfu_flist_io.c:1613] ERROR: ioctl failed: `/lustre/fsw/.../system/rc.service' errno=25 (Inappropriate ioctl for device)

I noticed that in those cases, "lsattr -p" also gives an error like this:

root# lsattr -p /lustre/fsw/.../system/rc.service lsattr: Operation not supported While reading flags on /lustre/fsw/.../system/rc.service root#

Could that error just be ignored and also end up with project id value of '0'?

Thanks,

adammoody commented 9 months ago

Right, we don't have a field for the project id within the internal flist structure that gets stored to the binary .mfu file. This means that it's not recording the project id value within the binary file.

The above hack can still work when reading a binary .mfu file as input, since it executes the project id lookup at the point it writes the text version of the list. However, you'd need to run from a system that has access to the file system(s) that were walked when generating the original .mfu file.

Adding the project id to the flist element would take some effort, because that structure needs to be updated in multiple places throughout the code base. It's also a bit disruptive since it forces a change to the .mfu file format.

As a future feature, I'd like to see whether we can enable users to attach arbitrary data to each file element. This data would need to be serializable to be stored to a binary .mfu file and/or sent between MPI processes. One could then use this user-defined value to store the project id, for example, or any other bit of additional data. However, that similarly requires some extended development effort.

Regarding the ioctl error, yes, that could be changed to use 0 with a one line fix.

projids[idx] = 0; // <--- initialize to 0, stays as 0 if get_projid returns an error
if (get_projid(file, &projid) == 0) {
    projids[idx] = projid;
}

However, since 0 seems to be the default project id for many files, I thought it might be useful to use another value to represent those error cases. For example, if we change the datatype from uint32_t to int32_t, we could initialize to -1 instead of 0. However, that assumes that one doesn't use project ids above what an int32_t can represent (int32 overflow).

For that matter, we could leave the type as uint32_t and use the max uint32_t value assuming that one is not using that as a valid project id.

To really handle that cleanly, it might require using a second field to indicate whether the corresponding project id is valid.

It's your call on which value you want to initialize here.

adilger commented 9 months ago

The projid needs to be a uint32 field, but the -1 value is invalid (there is a named constant for this, like PROJID_INVALID or something).

I suspect from the name that the .../system/rc.service file is a symlink to a file on a filesystem without projid support?

I think it would be most useful to not store a projid for such cases, rather than trying to store an invalid projid for the file.

markmoe19 commented 9 months ago

Right, it was a link to a file on an NFS file system. Probably storing nothing for projid in that case is the best response, I agree.

markmoe19 commented 9 months ago

I'm a bit torn on how to handle projid when using .mfu file. We have kept some .mfu files around as it provides a sort of snap-shot of the metadata that can be queried without the filesystem around (or even if the filesystem is there, it is no longer the same as some time as passed). Maybe, until the .mfu file can store the projid, the only valid way to get projid in the text output file is if you are going directly from file system scan to text file (and not keeping a .mfu file)? The good news is that now I have a way to capture project-id for all the files quickly, thank you! :)