openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.44k stars 1.73k forks source link

Direct IO #224

Closed behlendorf closed 6 years ago

behlendorf commented 13 years ago

The direct IO handlers have not yet been implemented. Supporting direct IO would have been a problem a few years back because of how ZFS copies everything in to the ARC cache. However, recently ZFS started supporting a zero-copy interface which we may be able to leverage for direct IO support.

ghost commented 12 years ago

hmm. why not to do it in that way: let O_DIRECT always return true? does it metter that ZFS copies everything in to the ARC cache? let fake a bit an OS. It shouldn't hurt so much.... oh, and that is just my freak idea

uejji commented 11 years ago

Unable to start mysqld with InnoDB databases living in a ZFS dataset. Is this related to this issue?

Using ppa:zfs-native/stable on Precise using Quantal kernel.

Here is the info of the system and dataset, followed by info from log snipped from /var/log/syslog

root@HumanFish:/# uname -a Linux HumanFish.net 3.5.0-17-generic #28-Ubuntu SMP Tue Oct 9 19:31:23 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

root@HumanFish:/# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 12.04.1 LTS Release: 12.04 Codename: precise

root@HumanFish:/# zfs get all zpool/mysql NAME PROPERTY VALUE SOURCE zpool/mysql type filesystem - zpool/mysql creation Mon Oct 29 9:44 2012 - zpool/mysql used 71M - zpool/mysql available 204G - zpool/mysql referenced 71M - zpool/mysql compressratio 3.38x - zpool/mysql mounted yes - zpool/mysql quota none default zpool/mysql reservation none default zpool/mysql recordsize 128K default zpool/mysql mountpoint /var/lib/mysql local zpool/mysql sharenfs off default zpool/mysql checksum on default zpool/mysql compression lzjb local zpool/mysql atime on default zpool/mysql devices on default zpool/mysql exec on default zpool/mysql setuid on default zpool/mysql readonly off default zpool/mysql zoned off default zpool/mysql snapdir hidden default zpool/mysql aclinherit restricted default zpool/mysql canmount on default zpool/mysql xattr on default zpool/mysql copies 2 local zpool/mysql version 5 - zpool/mysql utf8only off - zpool/mysql normalization none - zpool/mysql casesensitivity sensitive - zpool/mysql vscan off default zpool/mysql nbmand off default zpool/mysql sharesmb off default zpool/mysql refquota none default zpool/mysql refreservation none default zpool/mysql primarycache all default zpool/mysql secondarycache all default zpool/mysql usedbysnapshots 0 - zpool/mysql usedbydataset 71M - zpool/mysql usedbychildren 0 - zpool/mysql usedbyrefreservation 0 - zpool/mysql logbias latency default zpool/mysql dedup off default zpool/mysql mlslabel none default zpool/mysql sync standard default zpool/mysql refcompressratio 3.38x - zpool/mysql written 71M -

Oct 29 09:45:37 HumanFish mysqld_safe: Starting mysqld daemon with databases from /var/lib/mysql Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: The InnoDB memory heap is disabled Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Mutexes and rw_locks use GCC atomic builtins Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Compressed tables use zlib 1.2.3.4 Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Using Linux native AIO Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Initializing buffer pool, size = 256.0M Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Completed initialization of buffer pool Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Failed to set O_DIRECT on file ./ibdata1: OPEN: Invalid argument, continuing anyway Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: O_DIRECT is known to result in 'Invalid argument' on Linux on tmpfs, see MySQL Bug#26662 Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Failed to set O_DIRECT on file ./ibdata1: OPEN: Invalid argument, continuing anyway Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: O_DIRECT is known to result in 'Invalid argument' on Linux on tmpfs, see MySQL Bug#26662 Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: highest supported file format is Barracuda. Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Operating system error number 22 in a file operation. Oct 29 09:45:37 HumanFish mysqld: InnoDB: Error number 22 means 'Invalid argument'. Oct 29 09:45:37 HumanFish mysqld: InnoDB: Some operating system error numbers are described at Oct 29 09:45:37 HumanFish mysqld: InnoDB: http://dev.mysql.com/doc/refman/5.5/en/operating-system-error-codes.html Oct 29 09:45:37 HumanFish mysqld: InnoDB: File name ./ib_logfile0 Oct 29 09:45:37 HumanFish mysqld: InnoDB: File operation call: 'aio write'. Oct 29 09:45:37 HumanFish mysqld: InnoDB: Cannot continue operation. Oct 29 09:45:37 HumanFish mysqld_safe: mysqld from pid file /var/run/mysqld/mysqld.pid ended Oct 29 09:46:07 HumanFish /etc/init.d/mysql[19687]: 0 processes alive and '/usr/bin/mysqladmin --defaults-file=/etc/mysql/debian.cnf ping' resulted in Oct 29 09:46:07 HumanFish /etc/init.d/mysql[19687]: #007/usr/bin/mysqladmin: connect to server at 'localhost' failed Oct 29 09:46:07 HumanFish /etc/init.d/mysql[19687]: error: 'Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)' Oct 29 09:46:07 HumanFish /etc/init.d/mysql[19687]: Check that mysqld is running and that the socket: '/var/run/mysqld/mysqld.sock' exists!

behlendorf commented 11 years ago

@uejji I'm no mysql expert but this more related to #223. We don't yet support the aio, most applications in this instance fall back to the normal I/O syscalls.

uejji commented 11 years ago

@behlendorf I see. The errors about O_DIRECT in the log led me here through a Google search. I'll watch that issue in the meantime.

Thanks.

behlendorf commented 11 years ago

@uejji See http://forum.percona.com/index.php?t=msg&goto=7577&S=0d0bff59d914393490d494ffaa9205a5 for a workaround to the aio issue.

uejji commented 11 years ago

@behlendorf The innodb_use_native_aio option didn't exist by default in my.cnf, but adding it manually worked fine.

Thanks for locating the workaround for me. I guess the eventual goal will be that it's no longer necessary.

pavel-odintsov commented 10 years ago

Any news about O_DIRECT support?

pquan commented 10 years ago

It can't be honored as zfs is double buffer. o_direct makes no sense anyway. O_SYNC is a better way.

2014-04-21 14:18 GMT+02:00 pavel-odintsov notifications@github.com:

Any news about O_DIRECT support?

— Reply to this email directly or view it on GitHubhttps://github.com/zfsonlinux/zfs/issues/224#issuecomment-40931865 .

pruiz commented 10 years ago

Well, then a flag which allows 'ignoring' O_DIRECT requests (w/o failing) could be a plus on some situations.

I know this would can be dangerous on some situations, but there are others where this can be assumed, and also, non-advanced users can be notified by emiting some kind of warning, etc. when such a flag is set.

pruiz commented 10 years ago

Another option would be providing a flag with three options (ignore, dsync, sync), which would mean:

Greets

behlendorf commented 10 years ago

Making the behavior of O_DIRECT configurable with a property sounds like it may be a reasonable approach. However, we should be careful not to muddle the meaning of O_DIRECT.

The O_DIRECT flag only indicates that all the kernel caching should be bypassed. Data should be transferred directly to or from the user space process to the physical device. Unlike O_SYNC it makes no guarantees about the durability of the data on disk.

Given those requirements I could see a property which allows the following behavior:

pruiz commented 10 years ago

That sounds pretty neat, and would allow some scenarios not supported right now, even with their own tradeoffs. ;)

maci0 commented 10 years ago

newer versions of virt-manager want use cache=none as default for qemu virtual images which in turn means qemu tries to use O_DIRECT and libvirt will throw errors. the error messages will confuse most users not aware of the fact that ZoL doesn't support O_DIRECT yet. +1 for any kind of solution

mgancarzdsi commented 9 years ago

:+1: For this.. I've been experimenting with oVirt as a virtualization manager and I'd love to use ZFS for it's data stores, but as far as I understand, I can't add it as a local data store due to this issue.

maci0 commented 9 years ago

the solution in illumos kvm is rather crude too: https://github.com/joyent/illumos-kvm-cmd/blob/master/block/raw-posix.c#L97

pavel-odintsov commented 9 years ago

It's rather better than "silent ignore O_DIRECT".

behlendorf commented 9 years ago

After investigating what it will take to support this I'm bumping this functionality from the 0.6.4 tag. To add this functionality we must implement the address_space_operations.direct_IO callback for the ZPL. This will allow us to pin in memory the pages for IO which have been passed by the application. IO can then be performed directly to those pages. This will require us to add an additional interface to the DMU which accepts an struct iov_iter. While this work isn't particularly difficult, it's also not critical functionality and we don't want it to hold up the next release.

ryao commented 9 years ago

@behlendorf We can not just pin the user pages. We also need to mark them CoW so that userland cannot modify them as they are being read. Otherwise, we risk writing incorrect checksums. In the case of compression, userland modification of the pages while the compression algorithm is run would result in undefined behavior and might pose a security risk.

That said, I have a commit that implements O_DIRECT by mapping it to userspace here:

https://github.com/zfsonlinux/zfs/commit/a08c76a8ad63c28384ead72b53a3d7ef73f39357

It was written after a user asked for the patch and it is not meant to be merged, but the commit message has a discussion of what O_DIRECT actually means that I will reproduce below:

DirectIO via the O_DIRECT flag was originally introduced in XFS by IRIX
for database workloads. Its purpose was to allow the database to bypass
the page and buffer caches to prevent unnecessary IO operations (e.g.
readahead) while preventing contention for system memory between the
database and kernel caches.

Unfortunately, the semantics were never defined in any standard. The
semantics of O_DIRECT in XFS in Linux are as follows:

1. O_DIRECT requires IOs be aligned to backing device's sector size.
2. O_DIRECT performs unbuffered IO operations between user memory and block
device (DMA when the block device is physical hardware).
3. O_DIRECT implies O_DSYNC.
4. O_DIRECT disables any locking that would serialize IO operations.

The first is not possible in ZFS beause there is no backing device in
the general case.

The second is not possible in ZFS in the presence of compression because
that prevents us from doing DMA from user pages. If we relax the
requirement in the case of compression, we encunter another hurdle. In
specific, avoiding the userland to kernel copy risks other userland
threads modifying buffers during compression and checksum computations.
For compressed data, this would cause undefined behavior while for
checksums, this would imply we write incorrect checksums to disk.  It
would be possible to avoid those issues if we modify the page tables to
make any changes by userland to memory trigger page faults and perform
CoW operations.  However, it is unclear if it is wise for a filesystem
driver to do this.

The third is doable, but we would need to make ZIL perform indirect
logging to avoid writing the data twice.

The fourth is already done for all IO in ZFS.

Other Linux filesystems such as ext4 do not follow #3. Mac OS X does not
implement O_DIRECT, but it does implement F_NOCACHE, which is similiar
to #2 in that it prevents new data from being cached. AIX relaxes #3 by
only committing the file data to disk. Metadata updates required should
the operations make the file larger are asynchronous unless O_DSYNC is
specified.

On Solaris and Illumos, there is a library function called directio(3C)
that allows userspace to provide a hint to the filesystem that DirectIO
is useful, but the filesystem is free to ignore it. The semantics are
also entirely a filesystem decision. Those that do not implement it
return ENOTTY.

Given the lack of standardization and ZFS' heritage, one solution to
provide compatibility with userland processes that expect DirectIO is to
treat DirectIO as a hint that we ignore. This can be done trivially by
implementing a shim that maps aops->direct_IO to AIO. There is also
already code in ZoL for bypassing the page cache when O_DIRECT is
specified, but it has been inert until now.

If it turns out that it is acceptable for a filesystem driver to
interact with the page tables, the scatter-gather list work will need be
finished and we would need to utilize the page tables to make operations
on the userland pages safe.

References:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
https://blogs.oracle.com/roch/entry/zfs_and_directio
https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
https://illumos.org/man/3c/directio
https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fcntl.2.html
https://lists.apple.com/archives/filesystem-dev/2007/Sep/msg00010.html
Bronek commented 9 years ago

@ryao thanks for the writeup. I use ZFS zvols as a backing storage for qemu, and am also vaguely familiar with how databases perform IO. Mapping direct_IO to AIO is definitely good first step but it would be great if both 2. and 3. received (eventually) attention as well.

Regarding 2. COW seems definitely like a good direction. I would also expect lower memory utilisation and possibly other performance gains from O_DIRECT if (and only if) compression is not enabled.

Regarding 3. that's interesting one. One can rather trivially (although not cheaply) increase IO subsystem performance by attaching nvme PCIe backed SLOG device; it would be great if ZIL could be used (if configured so - an extra option would be needed) as a primary backing storage mapped to O_DIRECT rather than indirect logging. This would help to preserve the benefits of fast SLOG device i.e. very low latency of synchronous writes; while at the same time guaranteeing data safety and low memory utilisation (primary goals of O_DIRECT in scenarios I am familar with).

nigoroll commented 8 years ago

As a preliminary yet generic relief I have written an interposer to either map O_DIRECT to O_DSYNC or just ignore it. As most of the source infrastructure required was already there, I integrated it into https://code.uplex.de/liblongpath - ignoring, for the time being, that the main purpose of the liblongpath project was quite different. The relevant commit is https://code.uplex.de/liblongpath/liblongpath/commit/2e46a921ce2b6b1caa56d39cbd58be85c5988bd0 The commit message contains basic usage info, I have not (yet) added any other documentation for this feature.

ryao commented 8 years ago

@nigoroll Most Linux filesystem drivers, including ZoL, treat O_SYNC and O_DSYNC the same, sothat will not make much difference here.

You can get the indirect logging on all I/Os (O_DIRECT or not) that I mentioned by setting logbias=throughput on the dataset..

nigoroll commented 8 years ago

@ryao, I fail to understand how your comment relates to the interposer I have written. Its purpose is to provide relief where O_DIRECT cannot be avoided and open() calls returning with EINVAL break applications (which may even be closed source). Basically this implements what @pruiz suggested, but on the level of an interposer library.

ryao commented 8 years ago

@nigoroll I was thinking of this from the perspective of performance, where software using O_DIRECT almost always uses O_SYNC, so it does not improve things over the patch to ZoL to ignore O_DIRECT by mapping it to AIO. It makes more sense when thinking about software compatibility.

Thanks for writing that library.

azeemism commented 8 years ago

@ryao, what would you recommend as a best practice for now:

comment or remove the innodb_flush_method variable in /etc/mysql/my.cnf?

On MariaDB this would use fsync() to flush data and logs.

Or should O_DSYNC be used?

https://mariadb.com/kb/en/mariadb/xtradbinnodb-server-system-variables/#innodb_flush_method O_DSYNC - O_DSYNC is used to open and flush logs, and fsync() to flush the data files.

Values for this setting include:

O_DSYNC
O_Direct
fdatasync
O_DIRECT_NO_FSYNC
ryao commented 8 years ago

@azeemism fdatasync is the best option for MariaDB on ZFS right now. O_DSYNC is the equivalent of calling fdatasync after each and every write operation while neither O_Direct nor O_DIRECT_NO_FSYNC should work on ZFS unless you patch it to implement the ->directIO VFS operation. Patching ZFS to add it would have no benefit in production over using fdatasync at best and at worse, would render MariaDB crash-unsafe.

MarkGavalda commented 8 years ago

@azeemism @ryao MariaDB 10.1 throws "[ERROR] InnoDB: Unrecognized value fdatasync for innodb_flush_method" so no more fdatasync. Interestingly the documentation still mentions it: https://mariadb.com/kb/en/mariadb/xtradbinnodb-server-system-variables/#innodb_flush_method

rlaager commented 8 years ago

@ryao How much work would it take (on top of your commit earlier) to make O_DIRECT imply primarycache=metadata semantics?

FlorianHeigl commented 8 years ago

Anyone can comment on that first post in this issue? That looked into the double buffering matter and hinted it might not be a problem anymore after 2011? Like maci0 said, the other ways seem really crude. And I'd rather have it disable like it does now than map into fdatasync.

Application: "please don't buffer this at all, I'm trying to optimize here while keeping data safe" FS: "Sure, I'll not buffer this at all. I'll not slow you down and I'll not lie to you about when the data is on disk" FS turns around and says "yeah, we'll just flush when application does flushes, heh, it can't be worse than ext3, right?!"

nigoroll commented 8 years ago

@FlorianHeigl many comments here related to the initial note, just take the time to read them. Things are not that easy, if you want to understand why, I'd recommend @ryao s comment as of 23 Jul 2015

mcr-ksh commented 7 years ago

+1. libvirt wont run with those flags:

<disk type='file' device='disk'>
  <driver name='qemu' type='raw' cache='none' io='native'/>
MyPod-zz commented 7 years ago

@mcr-ksh although that combination of cache and io is optimal, you can still get libvirt to use volumes hosted on zfs by either not defining cache and io, or by selecting io='threads' and an appropriate cache policy

lnxbil commented 7 years ago

It would be really great if we could get this simple sounding fix (create a shim to AIO) of @ryao:

Given the lack of standardization and ZFS' heritage, one solution to provide compatibility with userland processes that expect DirectIO is to treat DirectIO as a hint that we ignore. This can be done trivially by implementing a shim that maps aops->direct_IO to AIO. There is also already code in ZoL for bypassing the page cache when O_DIRECT is specified, but it has been inert until now.

If this is only a simple hack that does not imply the drawbacks he mentioned in 1-4, could it be implemented and e.g. activated by a filesystem property of some sort?

Vringe commented 7 years ago

I also have problems adding the cache=none parameter to libvirt XML. Please add direct IO support.

Bronek commented 7 years ago

@kpande I can confirm it works . Although performance is not great , compared to cache=writeback io=threads . However I thought that cache=directsync io=native is the one to use , for direct IO ?

Vringe commented 7 years ago

@kpande Yes, but I'm not using ZVOL. It's just the RAW file which is stored in the dataset.

@Bronek According to the docs, directsync is described as follows: This mode causes qemu-kvm to interact with the disk image file or block device with both O_DSYNC and O_DIRECT semantics, where writes are reported as completed only when the data has been committed to the storage device, and when it is also desirable to bypass the host page cache. Like cache=writethrough, it is helpful to guests that do not send flushes when needed. It was the last cache mode added, completing the possible combinations of caching and direct access semantics.

I think you're right. Seems like it is very similar to writeback, except for the performance impact.

lnxbil commented 7 years ago

@Vringe Don't store RAW files as ZFS as files. There is no benefit and it is not as fast as it could be. ZVOL is perfect for that and you can snapshot a VM at a time and you have a constant (vol)block size.

Vringe commented 7 years ago

@lnxbil Each VM has it's own dataset, so I can easily create snapshots. During my performance tests, I found out that ZVOLs are not really better performing than datasets. There were also some really weird problems I had with ZVOLs. The environment is running well. I just want to use swap on the guest machines instead of using the hosts cache.

lnxbil commented 7 years ago

@Vringe yet you cannot use trim and therefore have not so efficient thin provisioning. You can also have bad write amplification with a dataset if you have not set the recordsize properly. If you have different access pattern from you emulation layer. zvol will always use the volblocksize

livelace commented 7 years ago

That is why I cannot use Ovirt with Local Storage configuration for years, because it wants DIRECT_IO.

livelace commented 7 years ago

use zvol, O_DIRECT works fine there.

Do you use Ovirt ?

jumbi77 commented 7 years ago

Any news on real o_direct support since scatter-gather list gets merged? Just curious, thanks

cwedgwood commented 7 years ago

@jumbi77 I think we have to consider what O_DIRECT actually means for ZFS ZPL.

Historically in the past for other file-systems it more-or-less meant that the data was transferred from the disks directly into the userspace buffers without any intermediate buffering, but with newer storage stacks that mapping isn't possible (this is true for non ZFS in many cases too).

@behlendorf I suggest O_DIRECT really means O_SYNC with "as little buffering as possible".

rlaager commented 7 years ago

I think it means "as little buffering as possible". If I want O_SYNC, I'll say O_SYNC (instead of or in addition to O_DIRECT). The open(2) man page on Linux explicitly says that it doesn't guarantee the same semantics as O_SYNC and you need to pass both if you want both.

pk1234 commented 6 years ago

We are using ZFS on all of our production machines (mostly Solaris an Linux) and our backup strategy is based upon ZFS-snapshots.

So far we only used Oracle Databases on Solaris machines and Oracle runs just fine on ZFS. There's even a Sun-whitepaper with information about optimal ZFS-configuration for Oracle databases.

Unfortunately Oracle fails under Linux if Archive Log files are stored within a ZFS volume and I would not mention this here if Direct IO wasn't the culprit. Here's a single line from strace-output:

open("/var/oracle/diag/rdbms/b1/B1/metadata/ADR_INTERNAL.mif", O_RDONLY|O_DIRECT) = -1 EINVAL (Invalid argument)

I'm aware of the following 3 possible solutions:

I tried to ignore O_DIRECT within ZFS first and my idea was to find the line of source where ZFS refuses to open a file with O_DIRECT-flag. But searching within the ZFS source code for O_DIRECT resulted in almost nothing. Seems like the ZFS-software does not reject open() calls with O_DIRECT but the VFS layer knows that ZFS is lacking DirectIO-support and therefore rejects open() calls with O_DIRECT.

Is that correct? Can I patch my kernel such that VFS does ignore O_DIRECT for ZFS filesystems?

So I tried to add direct IO support to ZFS. Have a look at this patch from 2015. It does not work with current ZFS and I doubt that i has worked with former versions (there are unbalanced #if-/#endif-lines and int rw = iov_iter_rw(iter); uses the undeclared variable iter).

But the idea should work: Adding a zpl_direct_IO() routine to zpl_file.c and adding this routine to the zpl_address_space_operations structure.

I tried that but i did not bother with the config-macros that detect what kind of VFS-API is in use with my 4.4.113-kernel. Here's what I added to zpl_file.d:

static size_t
zpl_direct_IO(struct kiocb *kiocb, struct iov_iter *from, loff_t offset)
{
        if(iov_iter_rw(from)== WRITE){
                return (zpl_iter_write_common(kiocb, from->iov, from->nr_segs, kiocb->ki_nbytes));
        }
        return (zpl_iter_read_common(kiocb, iovp, nr_segs, kiocb->ki_nbytes));
}

const struct address_space_operations zpl_address_space_operations = {
        .readpages      = zpl_readpages,
        .readpage       = zpl_readpage,
        .writepage      = zpl_writepage,
        .writepages     = zpl_writepages,
        .direct_IO      = zpl_direct_IO
};

This does not work. Calling zpl_iter_write_common() with 3 parameters was copy&pasted from the 2015-patch, but zpl_iter_write_common() has 6 parameters right now. To make this work I need some expert-advise.

How do I add the missing parameters to zpl_iter_write_common() and zpl_iter_read_common()? Does that make sense at all?

Since adding direct IO support into my kernel with the above hack failed I decided to ignore O_DIRECT outside of ZFS. LD_PRELOAD is your friend if you try to replace a system call with something else. In my case I created the library libOpenWithoutDirectIO.so from the following source code:

/* compile this with
   gcc -Wall -fPIC -shared -o libOpenWithoutDirectIO.so thisfile.c
   and use
   export LD_PRELOAD=/path/to/libOpenWithoutDirectIO.so
*/
#define _GNU_SOURCE
#include <dlfcn.h>
#include <fcntl.h>
#include <stdarg.h>

int open(const char *path, int flags, ...){
        static int (*func)(const char *path, int flags, ...);

        if(!func) func=dlsym(RTLD_NEXT,"open");
        flags &= ~O_DIRECT;
        if(flags & O_CREAT){
                va_list a; mode_t mode;
                va_start(a,flags); mode=va_arg(a,mode_t); va_end(a);
                return func(path, flags, mode);
        }
        return func(path, flags);
}

int open64(const char *path, int flags, ...){
        static int (*func)(const char *path, int flags, ...);

        if(!func) func=dlsym(RTLD_NEXT,"open64");
        flags &= ~O_DIRECT;
        if(flags & O_CREAT){
                va_list a; mode_t mode;
                va_start(a,flags); mode=va_arg(a,mode_t); va_end(a);
                return func(path, flags, mode);
        }
        return func(path, flags);
}

This will remove O_DIRECT on every open()/open64() system call. I don't like this because O_DIRECT should be removed if and only if path is pointing to a file that is located within a ZFS-volume. With the above hack O_DIRECT is removed from every open()/open64() call.

But Orace is running now on top of ZFS.

Any comments?

Peter

nigoroll commented 6 years ago

@pk1234 see https://github.com/zfsonlinux/zfs/issues/224#issuecomment-160126586 https://code.uplex.de/liblongpath/liblongpath/commit/2e46a921ce2b6b1caa56d39cbd58be85c5988bd0

Here we try the original flags and only remove O_DIRECT if they fail.

lnxbil commented 6 years ago

Hi Peter,

This will remove O_DIRECT on every open()/open64() system call. I don't like this because O_DIRECT should be removed if and only if path is pointing to a file that is located within a ZFS-volume. With the above hack O_DIRECT is removed from every open()/open64() call.

I tried a similar thing over a year ago and my listener was not able to work in xml mode, only plaintext mode. I could not patch that, yet the database was working. I have to search the writeup of my work at home and post it here. It's so funny to see that we both tried to solve the problem similarly. I went also one step further and tried to patch glibc to just use it everywhere, transparently, yet that did not work in the limited time I had. I finished my investigation after spending over 10h on that topic.

nigoroll commented 6 years ago

@lnxbil I can tell you that my interposer does the job.

au-phiware commented 6 years ago

@behlendorf acf0ade seems unrelated to Direct IO... did you mean to close this one?

behlendorf commented 6 years ago

@au-phiware whoops, no I did not. It was accidentally caused by merging the SPL and it's history in to the ZFS repository. See PR #7556, we'll probably have a few more of these.

pkramme commented 6 years ago

What is the progress on this? I tried installing OVirt, but, as Oirt needs direct IO, the installation failed.

My workaround is to use a ZVOL with XFS in it.