nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
106.17k stars 28.89k forks source link

Compressed packages #1278

Closed piranna closed 8 years ago

piranna commented 9 years ago

(Originally posted at Node.js)

Allow to require() compressed packages as dependencies inside ǹode_modules folder, the same way Python does. The idea is to use less disk space and make transfer of projects faster, since it will only move one file instead of a set of them. Also it would allow to checksum them, obviously in this case the comprossed package would not have inside its dependencies except if sringwash or bundleDependencies are being used, but the Node.js packages resolution algorythm (search for a node_modules in the current folder and all its parents until /) would be capable to solve this.

To do so, the only change would be that when a compressed (.zip?) file is found inside the node_modules folder, expand it dynamically and process the package.json file the same way it would do with folders.

meandmycode commented 9 years ago

Well, as far as I understand - the transfer of packages from npm is with archives that are then expanded.

Also, I'm not sure this would actually use less disk space (perhaps in source control but then you shouldn't be checking in node_modules ideally anyway), the reason being that the archive will still need to be uncompressed somewhere when require(..)'d so even in the best case where you only ever unpack an archive once (which I feel is more complex than it seems), then you'll have the unpacked size plus the archived size anyway.

silverwind commented 9 years ago

modules inside such an archive would need fs access to their folder, which leaves uncompressing the whole thing as the only option, which in turn would induce startup delay.

I'm thinking a userspace module that wraps require() is probably the better option.

piranna commented 9 years ago

Well, as far as I understand - the transfer of packages from npm is with archives that are then expanded.

Correctly, and cached at ~/.npm folder. They could be used, just by copying them to node_modules instead of extract them there.

the reason being that the archive will still need to be uncompressed somewhere when require(..)'d so even in the best case where you only ever unpack an archive once (which I feel is more complex than it seems), then you'll have the unpacked size plus the archived size anyway.

This can be done temporaly in memory when calling require() to import the Javascript code to Node.js runtime and later discarded.

piranna commented 9 years ago

I'm thinking a userspace module that wraps require() is probably the better option.

require.extensions would be a good solution for this use case, but it's deprecated... :-(

meandmycode commented 9 years ago

I'd say in general that memory is more of a premium than disk space so for most cases that balance probably isn't worth it for most, although giving the user this choice probably isn't a bad thing, I'm just not sure it warrants the effort, potential breakages and maintenance cost, for example I'm not sure how things like __dirname would then work (at least in Windows) for an in-memory file system.

Just my 2c

piranna commented 9 years ago

for an in-memory file system.

Oh, no, this is not necesary, once the Javascript objects are loaded, files are not needed anymore. Other thing would be if a package has some extra files like images or so, in that case a filesystem would be needed anyway, but that's not related to how require() works but instead is a non standard usage of node_modules, so in that case it's not something that Node.js/Io.js should worry about but the developer and/or the package manager (npm) instead by not using a compressed package for it.

Fishrock123 commented 9 years ago

Maybe cc @bmeck

bmeck commented 9 years ago

Videos

Empire Node: https://www.youtube.com/watch?v=k5r0kQlsDgU

Node Summit: http://nodesummit.com/media/universal-distribution-via-node-archives/

@piranna please check out bmeck/noda-loader and bmeck/resource-shim. There are multiple reasons we want it to be a zip file and not a tarball.

@silverwind modules using archives etc. should move to a resource loading approach we tried to implement fs virtualization and saw leaky abstractions as a problem (look at jxcore's issue list about all of these). Use something like the links above to get read-only resources that do not attempt to do things like preserve stat() compatibility / fs case sensitivity / disable half of the fs module cause you can't write/ etc.

I have been doing a fair amount of testing this in userland and there are ways to do this well, but it does require some code changes. All the virtual fs things I have seen (jxcore, atom, nexe, http://www.py2exe.org/) have very hard to detect and solve issues with their fake fs. I'd rather have explicit changes that are blatant and somewhat easy to fix issues by making a resource API. As an added bonus you can start moving away from the archive based resources and introduce things like a generic way to load resources from the internet / purely memory (basically how the loader works right now) / etc.

bmeck commented 9 years ago

@piranna also forgot, you do need to keep the zip archive in memory otherwise you can get into mutation issues. Luckily we can just open the file rather than read it entirely.

bmeck commented 9 years ago

The 2 notable packages that have problems with this are node-school's workshopper and npm's bundled node-gyp due to trying to execute processes inside of their directory; the solution is to extract the executables to disk or to install them as siblings rather than inside of a module itself.

llafuente commented 9 years ago

How about new ways to do things...

require("xxx")

That leads to the following problems (exposed above)

In the end we need to keep legacy code working, so I would add something to package.json, something like "compess" : true (false by default) This way the developer will take care of the necessary steps needed to adapt the module.

Hope "node://" idea solve many problems.

piranna commented 9 years ago

@piranna also forgot, you do need to keep the zip archive in memory otherwise you can get into mutation issues.

I don't think this is necesary for Javascript... Or are you talking about resources (not exported Javascript code)?

Luckily we can just open the file rather than read it entirely.

Yeah, having an open file descriptor to the compresed file it's a nice idea, this way it would be prevented to be changed on the fly :-)

piranna commented 9 years ago

or to install them as siblings rather than inside of a module itself

This is intended for npm 3 or 4, and as I pointed on npm issue, on a flat tree hierarchy all dependencies would be compressed and only expand them and use their internal node_modules folder when there are versions conflict between dependencies.

piranna commented 9 years ago

if everything fails -> read node_modules.zip (That must be UNIQUE)

What advantages has this over node_modules/xxx.zip? I think this is cleaner, and don't need to re-pack (compress) the dependencies when they are installed, only copied...

bmeck commented 9 years ago

@piranna I was referring to mutation of resources (including partial mutation of module code).

Right now this is possible for example if you lazily require something. If you require part of a module, that lazily requires a file $a, and you change $a prior to the lazy require. This can and should be prevented in archives:

  1. sanity
  2. in order to allow code signing
  3. ensuring module level resources are read only

@llafuente I strongly urge not to use a virtual file system:

  1. case senstivity sensitivity varies per FS
  2. file permissions cannot be correctly emulated
  3. some systems do odd things to files internally such as path expansion or http://en.wikipedia.org/wiki/8.3_filename
  4. treating resources as filenames gets into logical problems like... why can't I use exec(cat $filename) when fs.readFile works?
  5. you cannot write to an in memory archive well / it is misleading since you start taking up RAM (when you might be trying to relieve memory pressure even!).

The major point is... changing fs to not refer to the system fs is a bad idea, and leads to implementation issues that are not solvable easily, logical issues that are not solvable easily, and misleading usage if you do full emulation.

Having all fs operations fail inside of your archive I find preferable to having a sometimes it works approach to the fs module. It makes it fairly fast to fail, and also fast to find what to fix (you get big angry warnings).

If you do want to continue at this route:

  1. the root __dirname should be the root of the module ala /path/to/node_modules/x.zip , this makes all collisions avoided by placing them inside of an actual file which filesystems do not allow (except MacOS dmg expansion).
  2. native modules as stated in my videos must be extracted and loaded to the filesystem. SmartOS still has a way to load shared libraries via memory, but other OS have removed that feature (or never had it...).
  3. porting to the new protocol should be more generic as you are specifying a VFS, something like "zip://" would make more sense than "node://"
  4. you would need a way for submodules to swapped into a node module mode if you use a protocol. This means code changes for submodules which is difficult. I actually might side on not having a protocol at all, and just being able to detect zip files transparently.

I still strongly urge you to go look at the issues of node/iojs/npm from filesystem differences (there are several). then go take a look at the projects I listed above for packaging issues related to the filesystem.

silverwind commented 9 years ago

Why not extract to os.tmpdir() and if the module touches any files there (detected by fs hooks), repackage on process termination? I think this packaging should work transparently and without introducing new API.

bmeck commented 9 years ago

@silverwind it depends on the purpose of compressed packages, for example:

llafuente commented 9 years ago

Hi @bmeck,

I felt that there is no truly solution that would work for all use cases, that's why I propose a clear path, that i believe covers most.

About protocol problems you mention.

case sensitivity varies per FS

It's a new protocol, it's our decision to choose if it's sensitive or not... (and should be because of linux)

file permissions cannot be correctly emulated

Don't ""really"" needed, permissions are at zip level (I hope that should be enough)

treating resources as filenames gets into logical problems like... why can't I use exec(cat $filename) when fs.readFile works?

"node://" isn't available at OS level, if you needed... I don't see any solution for this use case rather than extracting your (and maybe all) module on the fly in any implementation. spawn a process directly from the zip file or memory... i really don't know if it's even possible, even with the protocol...

you cannot write to an in memory archive well

I don't think anybody write data inside their modules, write to a temp folder and again not flag your module.

the root __dirname should be the root of the module ala /path/to/node_modules/x.zip [...]

Are you sure: fs.readFileSync(__dirname + 'src/template.html') makes sense? x.zip could be a folder... zip:// is not enough, because you can treat the zipfile itself as "root device" some like: zip://x.zip/index.js it's possible but strange and you can only open zips from real machine filesystem and not http... i don't find it clean, i hope you can provide an usage example with the current fs API.

Unique file (zip) vs one zip per module

I propose a new protocol that read one file for simplicity/efficiency/easy to implement. Manage multiple files will be a mess. When somebody need to read a file inside their module, you have to given them something. If there are 50 modules, you will have to get 50 fd open... instead of one.

The real problem i see here, is that both solutions must coexist and that lead to many edge cases well exposed here, and we can't find solutions to all, because i think there is none.

bmeck commented 9 years ago

@llafuente

Add a flag to package.json: "my module is ready to be compressed or not"

This smells to me, as it does not relate to actually changing your code to avoid problems when problems could arise. Its like telling someone "trust me", but not causing problems if it only mostly works.

Don't ""really"" needed, permissions are at zip level (I hope that should be enough)

if you use fs and fs.stat is not perfect, problems; if fs permissions are not perfect, problems; if fs limitations for modules that detect your os are not emulated... it goes on

it is not about being good enough. it is about being perfect or being a liar.

Are you sure: fs.readFileSync(__dirname + 'src/template.html') makes sense? x.zip could be a folder...

This is fine, we can't assume any .zip in your filepath is a file, you always much run a check that it is a zip file. Same as we can have folders with .js on the end. The point is you can't have both a file and a folder at the same path on your filesystem. Once you have a zip file there, we know that inner files will not be possible and we can create our VFS root at that point.

zip:// is not enough, because you can treat the zipfile itself as "root device"

I am unclear on what you mean by "root device".

i don't find it clean, i hope you can provide an usage example with the current fs API.

I avoid this in bmeck/noda-loader but I do not use the fs api because of problems I see with compatibility that are not readily apparent (see projects' issues listed above). I could write the zip loader to emulate the fs api but I have no desire and see only hard to find problems arising if I do so.

If there are 50 modules, you will have to get 50 fd open... instead of one.

You only need to keep 1 fd open per zip file, once again look at noda.

The real problem i see here, is that both solutions must coexist and that lead to many edge cases well exposed here, and we can't find solutions to all, because i think there is none.

I am unsure they must coexist, but I do agree that they each have trade offs. I side with explicit changes mainly because I don't like my APIs to work differently (slightly, sometimes, not exactly) based upon context; this is true even if the API works well enough for most cases.

llafuente commented 9 years ago

I will expose all problems i see and we just find a solution that solve all (or most)

@bmeck you have experience, expose more crazy things!


The problems exposed are most of them academic, but there they are.

I will try to study what Python does and the limitations, I'm sure there are many.

bmeck commented 9 years ago

going to just state that I actively want the fs API to break with bundled resources in compressed modules. As such we need to tag the issues as VFS or new Resource API inclusive problems.

Github is bad for comparison views

I will put up a big comparison over the next day or 2 on a google spreadsheet that anyone can edit.

bmeck commented 9 years ago

also see https://github.com/node-forward/discussions/issues/10 and https://github.com/node-forward/discussions/issues/10#issuecomment-59823518

Re: Virtualized File System

Not going to happen. Has far too large a footprint on core code, too many unknowns to deal with and an > overall general PITA. I think @indutny properly stated it on IRC:

` oh god

no no no`
piranna commented 9 years ago

@bmeck:

@piranna I was referring to mutation of resources (including partial mutation of module code).

Ideally packages should be read-only and somewhat black-boxes, and most of them are, so package mutation should not be a problem and the compressed packages would encourage this. The only "valid" cases where a module would need to write some archives in itself are compiled ones, and in this case it's easier to left them decompressed or that they are compressed again after the compilation of the module, and in that case they could be easily identified because they have an "install" entry on the "scripts" field of "package.json" or a "build.gyp" file.

@llafuente:

zip:// is not enough, because you can treat the zipfile itself as "root device" some like: zip://x.zip/index.js it's possible but strange and you can only open zips from real machine filesystem and not http... i don't find it clean, i hope you can provide an usage example with the current fs API.

URIs are stackable, so what you propose could be somewhat doable with a file://zip:x.zip/index.js URI (sorry, I don't find the RFC saying this, but XBMC implemented this agressively... playing videos directly from inside zip files over SMB is crazy :-P ). Anyway, the problem here wold arise when directly accessing to non-Javascript files stored on the package(that the @bmeck resources API would help to fix and clean) or when importing Javascript files that are not directly or indirectly exported on modules.export (and I think these ones are less each day). I think require() of relative paths can be isolated to only being possible inside the package itself, and this would help to fix the packages that are not fully modular.

nubs commented 9 years ago

I also want to add that this could help a lot with reducing inode usage. It's not uncommon for node_modules to have a lot of files in them. These tend to multiply and result in hundreds of thousands of inodes being created. Most filesystems have a limited number of inodes at creation-time and this limit can be easily reached by node projects.

chrisdickinson commented 9 years ago

It might be worth moving this discussion over to iojs/NG.

piranna commented 9 years ago

For reference, Python has this feature implemented by the zipimport module since Python 2.3 :-)

Fishrock123 commented 9 years ago

@bmeck I read your spreadsheet over at https://docs.google.com/spreadsheets/d/1PEDalZnMcpMeyKeR1GyiUhOfxTzQ0D05uq6W9E9AVYo/edit#gid=0, and the resource API seems most reasonable to me, what's all involved for implementation? Any Caveats?

bmeck commented 9 years ago

@Fishrock123 https://github.com/bmeck/noda-loader + https://github.com/bmeck/resource-shim has it all setup, would need a route to get it into core / code coverage for it

bnoordhuis commented 8 years ago

Closing due to inactivity. No one followed up with working code so far and the intrusive changes needed to the module loader would probably make it a tough sell. I don't envy whoever wants to tackle this.

piranna commented 8 years ago

I'm still interested on this, I only left it for discussion. I don't think it would be difficult to do registering a new extension for require(), but since this way to do the things seems to be deprecated that's the reason why I didn't put any code. If you thing that registering a require() extension would allow to keep this moving forward I can do it really quickly... El 12/2/2016 10:21 PM, "Ben Noordhuis" notifications@github.com escribió:

Closed #1278 https://github.com/nodejs/node/issues/1278.

— Reply to this email directly or view it on GitHub https://github.com/nodejs/node/issues/1278#event-549079230.

bmeck commented 8 years ago

@piranna noda works fine for now, but I am not attempting to push it into core. the main purpose of my use case was superseded by using docker.

piranna commented 8 years ago

@piranna noda works fine for now, but I am not attempting to push it into core. the main purpose of my use case was superseded by using docker.

It's a shame :-( I'm too much busy at this moment with work and with NodeOS, if not I would try to push it to core myself, now with the flatten hierarchy of npm it would be really easy to do... :-(

Namek commented 8 years ago

I can't believe this topic died so easily! Being productive in node environment is really hard and this makes it a lot harder. npm should support compressing whole packages from foundation - if not server side then client side during download. Transfering project between computers is always so long I almost forget what I'm doing while waiting.

piranna commented 8 years ago

Maybe you could implement it? I think it shouldn't be too difficult to do a proof of concept at least as a registration for require() and see if it would get some traction... I would suggest you to register .tgz extension since it's the one used by 'npm pack'.

El 21/8/2016 16:22, "Namek" notifications@github.com escribió:

I can't believe this topic died so easily! Being productive in node environment is really hard and this makes it a lot harder. npm should support compressing whole packages from foundation - if not server side then client side during download. Transfering project between computers is always so long.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nodejs/node/issues/1278#issuecomment-241260390, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgfvkclVjurxwHgRLA60BVnzccAVEa7ks5qiF8cgaJpZM4D1dC6 .

Namek commented 8 years ago

I think it shouldn't be too difficult to do a proof of concept

I don't really see it as a small task. As I have some vision, I'll share it:

  1. npm install should download everything to RAM, not disk - that itself could speed up a lot.
  2. then, there is a step where versions of various modules are resolved due to multiple package.json dependencies constraints - this could be still managed in RAM
  3. then modules are compressed
  4. every call of require() first looks for .tgz files or the standard way as a fallback

The dilemma I can see here is how exactly to package modules into files because as I said simple vue.js webpack template project has about 770 dependencies (node_modules contains this number of subfolders). To improve it even more, whole node_modules could be packaged/compressed into one big file.

If I misunderstood something then it's probably due to my too flat knowledge of node environment.

Maybe you could implement it?

I don't feel like doing this. There some people having more time and still developing node/npm, I'm not one of them.

I'm just curious why people aren't discussing this for sake of performance of their work. npm install itself is taking so long with some configurations that I can't stop wondering how maintaners of npm could be OK with this. Maybe this is just me or maybe people don't understand it doesn't have to be this way. Remember development in C++ - when needed some libraries I would just copy few DLLs. And here? For simple vue.js webpack template it's over 18k of files and I won't even mention Angular 2.

piranna commented 8 years ago

You are confusing Node.js with npm. For Node.js is just a matter of register .tgz extension, and they will be pick automatically from node_modules, that's all. When this is working, npm could simply just copy or symlink the dependencies from its cache, and it will work as is. Only problems would be versions conflict that force a non-flat node_modules folder, that in this case npm would use decompressed packages as have been always doing, and packages that mutate its content after install, that's a developer bad design. Definitely, not a too much difficult task in any case.

El 22/8/2016 22:25, "Namek" notifications@github.com escribió:

I think it shouldn't be too difficult to do a proof of concept

I don't really see it as a small task. As I have some vision, I'll share it:

  1. npm install should download everything to RAM, not disk - that itself could speed up a lot.
  2. then, there is a step where versions of various modules are resolved due to multiple package.json dependencies constraints - this could be still managed in RAM
  3. then modules are compressed
  4. every call of require() first looks for .tgz files or the standard way as a fallback

The dilemma I can see here is how exactly to package modules into files because as I said simple vue.js webpack template https://github.com/vuejs-templates/webpack project has about 770 dependencies (node_modules contains this number of subfolders). To improve it even more, whole node_modules could be packaged/compressed into one big file.

If I misunderstood something then it's probably due to my too flat knowledge of node environment.

Maybe you could implement it?

I don't feel like doing this. There some people having more time and still developing node/npm, I'm not one of them.

I'm just curious why people aren't discussing this for sake of performance of their work. npm install itself is taking so long with some configurations that I can't stop wondering how maintaners of npm could be OK with this. Maybe this is just me or maybe people don't understand it doesn't have to be this way. Remember development in C++ - when needed some libraries I would just copy few DLLs. And here? For simple vue.js webpack template https://github.com/vuejs-templates/webpack it's over 18k of files and I won't even mention Angular 2.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nodejs/node/issues/1278#issuecomment-241538173, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgfvraurfqr-XGNY1Ef1n-tmCU1colgks5qigWqgaJpZM4D1dC6 .

bmeck commented 8 years ago

https://github.com/bmeck/noda-loader has a .noda file extension loader that expects the file to be a ZIP file. I would avoid TAR files since they don't support random access, and must be read sequentially in order to find files.

piranna commented 8 years ago

That's because tar doesn't have a central directory, but it's possible to build one on the fly while reading it...

El 22/8/2016 23:09, "Bradley Meck" notifications@github.com escribió:

https://github.com/bmeck/noda-loader has a .noda file extension loader that expects the file to be a ZIP file. I would avoid TAR files since they don't support random access, and must be read sequentially in order to find files.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nodejs/node/issues/1278#issuecomment-241551094, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgfvkkG0o7kh3hl_yCGr84NOAnNS80qks5qihAWgaJpZM4D1dC6 .

bmeck commented 8 years ago

If you want to read all of it, note that node_modules dirs get particularly large when you are putting dependencies inside of them.

piranna commented 8 years ago

I don't want to read them, but instead walk over them. It's only needed the offsets for each file, and later can be accesed with mmap, problem would be with compression...

El 22/8/2016 23:17, "Bradley Meck" notifications@github.com escribió:

If you want to read all of it, note that node_modules dirs get particularly large when you are putting dependencies inside of them.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nodejs/node/issues/1278#issuecomment-241553244, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgfvndLZg6kp8D7mYFtFV01vDKIO_Ydks5qihHdgaJpZM4D1dC6 .

ghost commented 7 years ago

Did anyone solve this?

Does this help??

https://github.com/webpack/memory-fs/issues/43

darkguy2008 commented 6 years ago

This does not have enough likes! I'm tired of 5-minute transfers with a simple Angular 6 CLI project with node_modules installed between a HDD and a SSD. Even inside SSDs, the file count is huge. Any NPM project is a pain to compress, to move, to copy, whatever.

Any progress on this?

bmeck commented 6 years ago

@darkguy2008 see https://github.com/WICG/webpackage which should be compatible with Node, but isn't coming in short term due to needing to be ratified.

arcanis commented 4 years ago

I sometimes feel like a __halt_compiler directive is all that's needed. Then userland could develop its own unpacking strategies, whether it's zip, tar, or something else entirely 🤔

skyshore2001 commented 4 years ago

Thanks to linking my feature request here. Hope the topic will be reopen later.