hhvm / hsl-experimental

Experimental features for the Hack Standard Library
MIT License
23 stars 10 forks source link

Add Filesystem\unlink(), and similar operations that aren't related to handles #29

Open fredemmott opened 5 years ago

fredemmott commented 5 years ago

Current thinking:

azjezz commented 5 years ago

File\ will be reserved for actual file handle/fd operations

agree

Filesystem\ is a potential place for things that operate on the filesystem but do not involve a specific fd, or require that they are operating on a plain file - e.g. Filesystem\rm_rf (not actually calling a function that though...)

agree

It is probably necessary for some behaviors to expose the posix behavior thinly, e.g. OS\readdir returning dirent as a shape

how would the dirent shape look like ? match the linux c structure ? i.e :

struct dirent {
    ino_t          d_ino;
    off_t          d_off;
    unsigned short d_reclen;
    unsigned char  d_type;
    char           d_name[256];
};

however, if we do that, where do convenience functions go? OS\readdir (POSIX) and OS\readdir_recursive? (convenience function) OS\readdir and Filesystem\readdir_recursive?

OS\

should Filesystem\ functions return types return the OS\ thin-wrappers-over-posix or an abstraction?

an abstraction, let's keep OS\ for low level things and the you don't need when developing an average website ( which is what hack is mostly used for, even if that's not the goal of hack )

see : https://github.com/nuxed/filesystem

say they return an abstraction, with Filesystem\readdir returning a FileInfo object shared with Filesystem\stat - what does FileInfo::isExecutable() do if it was created by a readdir, as the dirent does not include that information?

silent stat() syscall? throw an exception? have separate DirFileInfo vs FileInfo with a common base interface? Leaky abstraction

silent stat() syscall. no need to leak implementation details. if it is possible to determine if the node is executable, we do so.

probably best to just implement OS\ as a thin posix wrapper and leave Filesystem\ for third-party projects; review what third-party projects build and reconsider later.

agree, IMHO i don't think Filesystem\ is urgent, OS\ is more important at the moment ( i would also like to stop using php stdlib in https://github.com/nuxed/filesystem ) .

jjergus commented 5 years ago

I don't like the distinction between File and Filesystem, why not File\unlink? "Everything in File must work with File\Handles" feels like an unnecessary arbitrary restriction to me.

silent stat() syscall

I don't know -- in general I don't think we should hide potentially expensive operations that can fail inside what looks like a getter method, but I don't know if there's a better option here. It might be OK depending on the exact API (e.g. if it clearly doesn't look like a getter method).

If possible, I would maybe try to do the opposite: Have a function that clearly looks like an expensive operation that can fail, but sometimes it returns a value without actually doing the operation if the value is somehow already available.


I am starting to be a bit confused about the namespaces, OS vs IO vs File/TCP/etc. I'm not sure if it makes sense to simplify that or if that's necessary, but some random ideas:

Would it be crazy to merge OS and IO? We already settled on (I think) doing that for exceptions. Are there any OS operations that could not be reasonably described as IO? If not, then IO might be the preferable name, it seems more common than OS.

As a user of the language, I would really rather not end up in a situation where there are 4 different namespaces in which a function like "unlink" could reasonably exist (OS, IO, Filesystem, File).

fredemmott commented 5 years ago

how would the dirent shape look like ? match the linux c structure ?

Well-typed common subset of (macos & linux)

however, if we do that, where do convenience functions go? OS\readdir (POSIX) and OS\readdir_recursive? (convenience function) OS\readdir and Filesystem\readdir_recursive?

That'd probably be the best for user expectations, but it does mean that we'd have non-OS-defined functions in OS\; not sure that's right

an abstraction, let's keep OS\ for low level things and the you don't need when developing an average website ( which is what hack is mostly used for, even if that's not the goal of hack )

While I agree with the conclusion, and while current usage must be covered adequately, we are very definitely targeting the long-term goals with this design, even if it ends up having some negative effects for current uses.

silent stat() syscall. no need to leak implementation details. if it is possible to determine if the node is executable, we do so.

Very strongly against this - it is necessary to leak this particular one as "I accidentally called stat a million times" is a reasonably common performance problem. It should be explicit.


I don't like the distinction between File and Filesystem, why not File\unlink? "Everything in File must work with File\Handles" feels like an unnecessary arbitrary restriction to me.

File\chmod($directory) feels strange to me; I'm also worried that we might end up needing to put File back into classnames.


Would it be crazy to merge OS and IO? We already settled on (I think) doing that for exceptions.

Not crazy - though I feel that in a way OS\ is more well-defined: thin wrappers. I'd expect IO functions in OS\ to be a thinner layer over FDs, and not to provide functions like writeAsync(), requiring exlpicit C/async instead.

Perhaps that idea of OS\ might be better under _Private or something.

Also, if OS\ will contain non-IO stuff, it would be good to keep the mess of IO\ interfaces out of there for readability.

Are there any OS operations that could not be reasonably described as IO? If not, then IO might be the preferable name, it seems more common than OS.

Python's OS module seems pretty much the same thing (low-level wrappers of POSIX stuff); some examples:

There's some debatable cases like getcwd, exec, fork, kill

As a user of the language, I would really rather not end up in a situation where there are 4 different namespaces in which a function like "unlink" could reasonably exist (OS, IO, Filesystem, File).

Definitely agree. Unless we merge IO\ and OS\, I think it'd be good to strongly discourage using the stuff in OS\ except where necessary - but recognize it is necessary for completeness.


Quoting myself here:

but recognize it is necessary for completeness.

This isn't necessarily true if we ship C FFI - but the HSL won't be able to use that, and we should expect other deployments to disable it in libraries out of security concerns

jjergus commented 5 years ago

Perhaps that idea of OS\ might be better under _Private or something.

That sounds reasonable to me. What would be the disadvantages? Are there legitimate use-cases for people using low-level functions for anything where we provide a higher-level abstraction? (e.g. performance?)

File\chmod($directory) feels strange to me

I think I'd rather have a few slightly strange-feeling functions like this, than a library where people have to remember which functions are in File vs Filesystem based on how strange each one felt to the library authors :)

fredemmott commented 5 years ago

That sounds reasonable to me. What would be the disadvantages? Are there legitimate use-cases for people using low-level functions for anything where we provide a higher-level abstraction? (e.g. performance?)

Depends on if we tradeoff usability for completeness; for example, PHP and OCaml's readdir and scandir functions just return a list of filenames instead of dirent - if you want filenames and types, you need to do a function-call-per-file to find out; I've got a diff up where it looks like I'm making find in hh_server ~2x faster by switching to the posix call.

Also, if we want to be able to cover systems programming, we probably need to expose the POSIX-like behaviors (and there are non-technical restrictions on what we can name that namespace which is why errno is in OS); they're very specifically defined in annoying ways, but sometimes you need that (e.g. I needed that when implementing async socket connections); I think if we aim to be sufficiently complete in a higher-level abstraction, we're going to fail, and have a lot of hard-to-debug problems.

I think I'd rather have a few slightly strange-feeling functions like this, than a library where people have to remember which functions are in File vs Filesystem based on how strange each one felt to the library authors :)

Yeah... I need to figure out if there's an underlying reason that can be consistently applied, or if I'm just being weird about that :)

jjergus commented 5 years ago

Yeah I guess if one of the namespaces (OS?) is very explicitly for low-level things that almost no one needs, and anything that is not strictly equivalent to a low-level call goes to a different namespace, that seems pretty reasonable. "If you have to ask than this is not the namespace you want" is an easy enough rule for people to remember :)

fredemmott commented 5 years ago

Does "almost no one ever handles IO failures" count as "almost no one needs OS\Errno"? 😅

fredemmott commented 5 years ago

So, are we leaning towards:


Okay, actually a little bit of bikeshedding about readdir: I could go a step thinner:

using $dir = OS\opendir($path) {
  while ($dirent = OS\readdir($dir)) {
    // ...
  }
}

This would be very slightly better if you have a huge directory and want to exit early. It would be worse otherwise - many more C<->Hack calls.

IMO we should generally consider the 'returns each in turn then returns nullptr' C pattern to be 'the block as a whole returns a vec', and if there is a case where it turns out to matter, add both forms.

fredemmott commented 5 years ago

Actually that's another case of my fuzzy feelings rather than consistent principle. Let's go for as thin as practical, and add the slightly-less-thin versions if needed after profiling.

so, OS\opendir($path), OS\readdir($dir), add OS\readdir_vec($path) or something later if it turns out to actually be a problem.

jjergus commented 5 years ago

Sounds reasonable. Also eventually the OS namespace should probably be built-in, not part of HSL, right? We don't want wrappers like

namespace OS {
  function opendir($path) {
    return \opendir($path);
  }
}

in the long term (might be needed temporarily).

azjezz commented 5 years ago

Sounds reasonable. Also eventually the OS namespace should probably be built-in, not part of HSL, right?

IMO, IO\, File\, Network\, TCP\, OS\ ... etc, all should be built-in instead of built in top of php stdlib

fredemmott commented 5 years ago

built-in

IMO ideally the whole HSL should be built-in :)

I suspect what I'll end up doing to 'release' HSL IO is to add a new _Private\Native\FileDescriptor type, and add HH\Lib\OS\fwrite(FileDescriptor, ...) etc, with none of the PHP behaviors. Won't directly use ints, as that would allow requests to access each other's FDs.

jjergus commented 5 years ago

IMO ideally the whole HSL should be built-in :)

I meant "built-in" as in written in C++ with Hack code only in .hhi files. I don't think we need that for the whole HSL, but definitely the whole HSL should be "built-in" as in shipped with HHVM.

fredemmott commented 5 years ago

I dream of a world where the PHP stdlib needs a runtime option to be available at all - and the HSL doesn't depend on it :)

azjezz commented 5 years ago

I dream of a world where the PHP stdlib is not available at all - and the HSL doesn't depend on it :)*

azjezz commented 3 years ago

i have implemented this in PSL, it is built in top of PHP stdlib, but could provide a reference for naming.. etc. https://github.com/azjezz/psl/tree/1.8.x/src/Psl/Filesystem