haskell / core-libraries-committee

95 stars 16 forks source link

Cache file seekability in `Handle__` to avoid unnecessary stat #270

Closed MaciejWas closed 1 month ago

MaciejWas commented 4 months ago

In this proposal I suggest a solution to the issue #24220 (hSeek and hTell do unnecessary stat) raised by @nh2. It has been observed that these two functions perform unnecessary stat() system call and slow down IO-based code.

Considered solutions

Implementation

I believe the third solution is the most viable. I implemented it in Store file seekability in Handle__

Impact assessment

Since this is a breaking change (definition of a datatype is changed), an impact assessment is due. I will perform it in the next couple of days and update this issue.

nh2 commented 4 months ago

Just checking:

If a type named Handle__ in a module labelled Stability: internal is not an implementation detail that GHC can freely change, what would GHC need to do to be able have implementation details that it can freely change?

Bodigrim commented 4 months ago

@nh2 any part of the public interface of base needs a CLC proposal to be changed.

nh2 commented 4 months ago

@Bodigrim My question was what would count as not "part of the public interface".

What is the criterion for that? How would I have to name or place a data type in order for it to count as "internal implementation detail, free to change"?

In the future, we may want to create such places so that one can make changes to syscall implementation behaviour without a proposal.

hasufell commented 4 months ago

I'm not sure Handle is the right abstraction for it.


For reference, the implementation is this:

 data FD = FD {
  fdFD :: {-# UNPACK #-} !CInt,
#if defined(mingw32_HOST_OS)
  -- | On Windows, a socket file descriptor needs to be read and written
  -- using different functions (send/recv).
  fdIsSocket_ :: {-# UNPACK #-} !Int
#else
  -- | On Unix we need to know whether this 'FD' has @O_NONBLOCK@ set.
  -- If it has, then we can use more efficient routines (namely, unsafe FFI)
  -- to read/write to it. Otherwise safe FFI is used.
  --
  -- @O_NONBLOCK@ has no effect on regular files and block devices at the moment,
  -- thus this flag should be off for them. While reading from a file cannot
  -- block indefinitely (as opposed to reading from a socket or a pipe), it can block
  -- the entire runtime for a "brief" moment of time: you cannot read a file from
  -- a floppy drive or network share without delay.
  fdIsNonBlocking :: {-# UNPACK #-} !Int
#endif
 }

isSeekable :: FD -> IO Bool
isSeekable fd = do
  t <- devType fd
  return (t == RegularFile || t == RawDevice)

So an alternative not mentioned here is:

That seems like a more generic solution and everyone relying on device type doesn't need to use stat anymore.

I'm not sure if there are obscure edge cases where people construct the FD type manually (without opening one).

Bodigrim commented 4 months ago

@Bodigrim My question was what would count as not "part of the public interface".

What is the criterion for that? How would I have to name or place a data type in order for it to count as "internal implementation detail, free to change"?

Anything which is exposed from a package constitutes its public interface.

In the future, we may want to create such places so that one can make changes to syscall implementation behaviour without a proposal.

You can put stuff into ghc-internal package and do not re-export it from or use in base.

@nh2 please raise a separate issue, if you have further questions, let's not digress here.

nh2 commented 4 months ago

Intuitively I agree with @hasufell.

Unfortunately I couldn't find any design documentation that says where the boundary between Handle and FD should be.

A Handle carries a hadDevice :: dev, which is an instance of IODevice.

This looks like Handles were designed to contain, and abstract over, multiple types of backend devices, one of them being FD.

However, FD is indeed the only IODevice instance, so we don't get much info on what other devices could exist, to determine where seekability should live.


Also note that Handle docs say

A handle has at least the following properties:

... whether the object is seekable;

but those docs are wrong:

The word "seek" appears nowhere else on that page, and seekability is a property of the IODevice, not a Handle, as @hasufell showed above.


Overall I like @hasufell's suggested implementation. Here's another slight alternative:

Aside: Symlinks

hasufell commented 4 months ago

Store the devType in FD, but as Maybe IODeviceType. This allows constructing FD manually from an int FD we got from the system, without knowing anything about it.

Yes, I thought about that. We could then rename the constructor and provide a bidirectional pattern synonym, similar to how e.g. bytestring tries to keep the internal API "stable". The compat synonym would then fill in Nothing for the device type.

But it seemed a bit overkill. Given that you already have to use CPP to pattern match/construct, I think it is fairly obvious that these are low level internals.

My stance these days is that:

But YMMV. I think using Maybe here is the most backwards compatible way.

MaciejWas commented 4 months ago

Thank you for sharing your thoughts!

This looks like Handles were designed to contain, and abstract over, multiple types of backend devices, one of them being FD.

This is also why I focused on Handle (and therefore Handle__).

While caching DeviceType in FD would indeed solve the issue, I don't think this is the right approach since there is nothing wrong with FD. The problem is the unnecessary check performed by hSeek which operates on a Handle.

Seeking is implemented for IODevice and the only instance is FD, that also means people using plain FD can't benefit from this

This proposal (and the issue as well, if I understood it correctly) is not about caching the device type so that we can avoid a stat syscall. This proposal is about not performing stat every time before performing seek, which is a behavior of Handle. FD doesn't have this problem.

I think it makes perfect sense that functions defined in instance IODevice perform a syscall or other IO operation. However, I think it does not make sense for hSeek to do unnecessary checks.

phadej commented 4 months ago

@nh2

In the future, we may want to create such places so that one can make changes to syscall implementation behaviour without a proposal.

In relation to this proposal. Handle__, IODevice are all GHC internals. They should not be exposed through base (anymore). (whole GHC.IO.Handle can be considered ghc-internal only. Most people should be fine with System.IO worth of Handle abstractions).

phadej commented 4 months ago

However, FD is indeed the only IODevice instance, so we don't get much info on what other devices could exist, to determine where seekability should live.

Not true. On windows there are

-- -----------------------------------------------------------------------------
-- The Windows IO device implementation

-- | @since base-4.11.0.0
instance GHC.IO.Device.IODevice (Io NativeHandle) where

-- | @since base-4.11.0.0
instance GHC.IO.Device.IODevice (Io ConsoleHandle) where

which wrap Windows own HANDLE (i.e. not POSIX file descriptor)

Bodigrim commented 4 months ago
  • step 1: create a proposal to remove an "internal" module from base; people ought to use ghc-internal. (This will take some time to execute with all the deprecation periods. EDIT: CLC should correct me, but there's also a maybe considerable cost of patch writing requirement: migrating whole Stackage to use ghc-internal - providing some patches).

Non-experimental packages are actively discouraged to use ghc-internal, there are no stability guarantees whatsoever.

CLC retains transitive control over parts of ghc-internal used by base, even if not re-exported explicitly.

But please let's not digress.

hasufell commented 4 months ago

This proposal (and the issue as well, if I understood it correctly) is not about caching the device type so that we can avoid a stat syscall. This proposal is about not performing stat every time before performing seek, which is a behavior of Handle. FD doesn't have this problem.

I'm a bit confused. The GHC issue you mentioned specifically talks about linux and stat.

What behavior are you expecting on windows? Why do you touch other platforms?

phadej commented 4 months ago

Non-experimental packages are actively discouraged to use ghc-internal

Yes, so the needed functionality, once identified, can be re-exported from ghc-experimental, so base can stay away from GHC specifics.

Bodigrim commented 4 months ago

@MaciejWas is there a consensus about next steps? Are you looking for further feedback? (Sorry, at the moment I don't have time to dive into details above, just asking about the status)

MaciejWas commented 4 months ago

Unfortunately the status is mostly as it was when I opened this proposal. I'm still committed to resolving it but I haven't had much time lately, unfortunately.

I still need to process the alternate approach proposed by @hasufell. It still seems to me that it is a much more complex change which only resolves a subset of the issue, since FD is not the only instance of an IODevice.

Bodigrim commented 3 months ago

@MaciejWas I very much appreciate that this is a difficult issue to tackle, many thanks for working on it. Nevertheless we strive to keep the list of open proposals actionable, so unless you make progress one way or another (e. g., it's your right to request a vote on the MR as is), I'll close as abandoned by the end of August. This is not a sign of malice, we can reopen the proposal once there are resources to work on it again.

Bodigrim commented 1 month ago

Closing as abandoned, feel free to reopen when there are resources to continue.