pillarjs / send

Streaming static file server with Range and conditional-GET support
MIT License
796 stars 188 forks source link

feat: added transform option #195

Open wesleytodd opened 4 years ago

wesleytodd commented 4 years ago

Prior work:

Adds a transform option. Allows users to transform inflight requests.

Things to note:

  1. As discussed in previous threads, when using the transform option it can invalidate etag and last-modified. I added documentation to that effect and set the defaults to false when transform is passed.
  2. Range requests, again as discussed previously, will behave oddly if you are not aware that your stream might not get a complete file. Also documented with a warning to set acceptRanges: false if you cannot write a transform which handles this kind of request.
  3. Compared to previous PRs, I added the res and opts to the function, this should make it easier to implement transforms which support range and other types of requests.

I think this resolves all the open conversations on the topic I saw from before. Let me know what you all think!

dougwilson commented 4 years ago

Thanks @wesleytodd for helping pish this forward. I am generally skeptical that this type of option is truly worth all the complexity added, ultimately. Did you have a specific use case you are trying to achieve by this feature you can describe?

wesleytodd commented 4 years ago

I do have a specific use case. I have an npm registry proxy, it caches the packuments and for now is re-writing the tarball urls before it writes them to disk. This is for robust npm client testing, and so I often open the registry on port 0. So every time I restart I have to blow away the cache because it points to the wrong port. I prefer to re-write on the other end, before sending it from the cache, which would enable me to re-use the cache. I was using serve-static for this, but without this feature in send I would have to extract my whole own file server. With this, I can just modify the json on the fly. Does that make sense?

wesleytodd commented 4 years ago

This is not all my comments yet, as it seems when you force push and i was in the middle of the review, all my pending comments are erased... i assumed it was ready to review by the review assignment.

Github really needs to fix that. Sorry I was fixing the build failures. I wont push anymore for a bit to be safe.

dougwilson commented 4 years ago

How does this interact with a http HEAD request, by the way? We want to be sure, even if this is an advanced feature, it doesn't end up violating http itself. Ideally we should prevent the user from making such violations if possible.

wesleytodd commented 4 years ago

How does this interact with a http HEAD request, by the way?

Great question, and one I did not test or think of. Now that you say it, I am confident this would be broken if you did a transform which modified anything represented in the headers values. Would it make sense to, when transform is passed, run the transform and read the file but not send it?

EDIT: also only do that when it is a file request

dougwilson commented 4 years ago

Yea, that could probably work. It is of course hard to know what the transform is going to do, and the transform (as it is currently) cannot make any shortcuts to simply in the case of HEAD, which may be useful. I assume that a transform could change the Content-Type and other aspects, is that right? Ultimately, all the HEAD needs to do is just return the same headers as the same request would have returned if it were a GET method (including Content-Length, if GET would set that).

wesleytodd commented 4 years ago

I assume that a transform could change the Content-Type and other aspects

Yeah, it could for sure do that. With your idea above of a fs wrapper I think we could get the behavior you are describing. To be honest though, we can get it also from this implementation.

If the only requirement is to be technically correct, then I think I prefer the transform interface from here over the extra complexity a user would need to have to get this with building their own fs.createReadStream and fs.stat.

dougwilson commented 4 years ago

If the only requirement is to be technically correct

I mean, I would say that I think the "requirement" is that the user can implement a transform without scracificing all the other features in this module by default. As in, they should be able to easily implement a transform that "just works" with the module and it's feature set that makes this module actually more valuable than fs.createReadStream(...).pipe(res).

then I think I prefer the transform interface from here over the extra complexity a user would need to have to get this with building their own fs.createReadStream and fs.stat.

Ideally I tried to clear this in a comment above; I'm not suggesting the user copy the interface of those two fs methods, as they do more than is necessary for this module. The two most popular requests for this module is transform and virtual fs. I think we can actually solve both at the same time, and to me, that is worth it just based on what the users seem to be wanting.

dougwilson commented 4 years ago

I do have a specific use case. I have an npm registry proxy, it caches the packuments and for now is re-writing the tarball urls before it writes them to disk. This is for robust npm client testing, and so I often open the registry on port 0. So every time I restart I have to blow away the cache because it points to the wrong port. I prefer to re-write on the other end, before sending it from the cache, which would enable me to re-use the cache. I was using serve-static for this, but without this feature in send I would have to extract my whole own file server. With this, I can just modify the json on the fly. Does that make sense?

I am definately not following on what your use-case means and how, exactly, the transform of this module actually helps in this case, haha. Are you saying that you have a bunch of files on disk you want to serve up, and then some of those are json, in which you want to read the json, modify it, and then return that as the response? If so, I can show you a solution that does all of that very easily without this functionality even being necessary. But I think I'm just not understanding the underlying ask, perhaps, not sure. Because unless these json files are gigabytes, doing what you are describing as a transform stream seems unnecessarily complicated compare with what is already possible with express today.

wesleytodd commented 4 years ago

If so, I can show you a solution that does all of that very easily without this functionality even being necessary.

Yes, but remember I currently have serve-static in between. I know I can use this module directly to do this, but I could not find a way to also have all of the serve-static logic and the transform.

I also could add this functionality at the serve-static layer for my use case. I only did it here because it looked like there was already some agreed upon progress. If we think it would be better to move this up one level, that would also work for me.

If not, I will take a stab at doing an abstraction over the fs calls tomorrow and see what I come up with.

I think your concerns are valid, I just worry about overcomplicating things for less common use cases. For example, writing a transform stream on a pdf. While I totally understand that there are many specific use cases for range requests, those are not commonly the use cases where you would use a transform stream.

Lastly, I fully agree we should never break spec, and also should "do the right thing" by default. But we cannot also let perfect get in the way of shipping the things that are useful to the end users.

wesleytodd commented 4 years ago

Actually, without support here, wouldn't that just move all of the complexity we are discussing into that layer? I guess that might also not be a perfect solution.

dougwilson commented 4 years ago

Yes, but remember I currently have serve-static in between. I know I can use this module directly to do this, but I could not find a way to also have all of the serve-static logic and the transform.

Adding caveats doesn't help :) I'm sure you can keeping adding them all day until you eventually describe what you're trying to achieve. Is it possible to share the code at all? My understanding above would still work even with your caveat, though.

Actually, without support here, wouldn't that just move all of the complexity we are discussing into that layer? I guess that might also not be a perfect solution.

No, but also I cannot say for sure unless you can share what you are trying to achieve so we can actually work together on a solution. Right now, there is not enough information unless I just accept your solution here for what it is. I want to help make a good, workable solution, but to do this, it needs to start at the goals what what it is that is trying to be accomplished in a meaningful way, so alternative solutions can be thought of and measured against that high-level goal.

wesleytodd commented 4 years ago

I added you on the private repo where I am working on this. One thing to remember, I took up this PR because it was existing work and it also solved my need. There are other ways I can solve my problem, but I wanted to take this opportunity to also move forward the more generic work here. So if we can use my direct use case as an example, but think about how a more generic solution can be added here for others to use, that would be awesome.

wesleytodd commented 4 years ago

Also an aside, this is where having a good real time chat for the team would be nice 😄. Slack even has pretty decent video chat and screen share.

dougwilson commented 4 years ago

The reason those did not land is due to the issues mentioned in those PRs, and this PR doesn't actually resolve those issues. On top of that, I ultimately think it is the wrong solution to the underlying goal, and the initial PRs simply just did what looked easy, but it turns out, it is much more complex and those PRs just didn't want to solve that complexity. Landing something "just because" feels very wrong to me. I think we need to provide good solutions to the issues and now just around the "hey, add this 1-2 lines to your source and then it works, but with all these caveats that are ultimately going to make it hard to use in practice" :)

wesleytodd commented 4 years ago

Yeah I agree completely that if it is hard to use in practice it is not worth landing. My point above is that "what is use in practice"? Because that usually doesn't mean every feature all the time working together. That said, I take your feedback and would rather move on to trying to see what a fs wrapper would look like because that can solve all of the problems well.

dougwilson commented 4 years ago

Yeah I agree completely that if it is hard to use in practice it is not worth landing. My point above is that "what is use in practice"? Because that usually doesn't mean every feature all the time working together. That said, I take your feedback and would rather move on to trying to see what a fs wrapper would look like because that can solve all of the problems well.

Yea, sorry, I am trying to articulate myself over this forum, so please bear with me if it is not quite coming through :) I think the point I am trying to make is that ideally a system does not have "two modes of operation" for example -- if using a transform stream by default will cut off a large amount of functionality this module is providing, at what point is it just a different module. For example, you can certainly get by with just fs.createReadStream(...).pipe(res). It seems that the transform option, as implemented, basically turns off mostly everything in this module that makes it more than that snippet, essentially creating a single API (the send stream) that operates very differently depending on a single option.

But apart from that, I think we can actually implement transform in a way that doesn't require turning off all those features, which is really what I'm trying to drive towards. Of course, users can still turn off features they do not want, but those features add a lot of value, and I think if transform is provided by the platform, it should feel like it's part of the platform, letting you take advantage of all the features the platform is providing (etags for caching, range requests for partial downloads / download resumption, etc.). And from the two main feature requests from this module (transform + virtual fs), I think we can provide an interface that can be both easy to use and solve both things, which would be the biggest win for users and ourselves.

dougwilson commented 4 years ago

I added you on the private repo where I am working on this.

Ok, so you are not using Express, I see. That would make the solution more complicated, because the added crutch added by trying to implement everything over again that Express provides, which seems a bit odd, really. I see quite a bit of things in there that is already Express sugar, but the bolt ons actually provide spec footguns that Express takes away :)

Regardless, where is the spot in the code in which you would need to place this functionality, exactly? I assume at the one spot where the serve-static module is used at?

dougwilson commented 4 years ago

Ok, I see where in there. The way you are using the router, you can implement this and reduce your overall complexity a lot. I think were you may get mixed up in that, for your packuments, those are not actually "static" for your use-case, but here you are, trying to use something created for static use-cases (I mean, it's part of the name of the module you imported 😆 ). I see what you mean by wanting to move the transform. I assume I can make a PR to your repo? If so, I would be happy to throw one up tomorrow, if you are willing, to see what I mean. I think it would reduce a lot of the complexity in your impl there, ultimately, and get what you are trying to achieve.

wesleytodd commented 4 years ago

Happy to have a PR, but I don't want you to spend your time if it just to use send directly and do send(...).pipe(transform).pipe(res). I can do that version of it myself. The reason it is the way it is today is because I thought I could get away with it being purely static. When I realized I could not, I figured one way I could solve it was finishing this PR.

wesleytodd commented 4 years ago

Also, happy to hop on a call, because it might be faster. The problem here is that I shared a work in progress, hence the private repo, so it is messy.

Since it is clear this PR will not move forward, I think we should close all of these so that noone else makes the same mistake I did with trying to pick it up again. Then I can work on my project with just send and a new api to implement the transform as discussed above.

dougwilson commented 4 years ago

but I don't want you to spend your time if it just to use send directly and do send(...).pipe(transform).pipe(res).

Gotcha. So I wasn't going to just change it to send(...).pipe(transform).pipe(res), haha. I don't think that would work, anyway, as you cannot just pass any steam as the argument to this module's .pipe, for instance, as this module is not actually a standard Node.js stream: it needs to write headers to the response and such for the features to work, for instance.

Also, happy to hop on a call, because it might be faster. The problem here is that I shared a work in progress, hence the private repo, so it is messy.

I am always happy to do so as well anytime; we don't have a "norm" for that, so would need to ad-hoc it somehow if you like; I think you have my phone number for example. While we were in the TC meeting yesterday, I actually opened for discussion the idea to have a "realtime" communications area, to see what folks think. If that gets going, that could be a method in the future (but for the time being, you can email me to set up something I suppose as a solution, lol).

I think we should close all of these so that noone else makes the same mistake I did with trying to pick it up again.

Hm... I guess that makes sense. I feel bad closing folk's PRs on them, but looking back on at least one (I didn't look at all before responding here), I guess they were stale in that I provided some feedback and OP didn't respond back. I can see from that POV. There is actually a top-level issue about transform that helps, as I think the idea to transform is valid. Just looking at the age of at least one of those PRs, I think once the virtual fs requests came up is when I started to form the new idea about getting them both done together, as I think I tried to articulate above (again, sorry if it's not clear, please be gentle; it's hard to share a large type of design over comments in GH, so probably need to whiteboard them and/or write up a technical document to describe what I'm thinking of to convey it better).

wesleytodd commented 4 years ago

Ok, so I spent a bit this morning looking at how we could provide a "synthetic filesystem". The first challenge will be that we expose the underlying stat data in our api via the file event. We use stat.size, stat.mtime and stat.isDirectory, but then pass the full stat to self.emit('file' ... . So if we were to shoot for non-breaking changes we would need an api which would allow for replicating the entire stat object even if we only use part of it.

I think there are reasonable breaking changes we can make, but since I don't think this is nearly important enough to hold up express@5 for, it would take years to roll out across the board. I am not sure the trade-off would be worth it on that timescale, so I think we should shoot for an api which allows for backward compatibility. If you, @dougwilson, agree with this I will continue my experimentation with this in mind.

wesleytodd commented 4 years ago

OK! I did a bunch more experimentation and I found an interesting solution which enables what I want while also not requiring any changes in this lib. Posting here to see what you think, and if it seems fine to you, I will publish the work as a module and we can include an example of how to use it.

function transformResp (res, transform) {
  transform.pipe(res)
  return new Proxy(res, {
    get: (target, prop) => {
      if (prop === 'write' || prop === 'end') {
        return (d, e, c) => transform[prop](d, e, c)
      }
      return target[prop]
    }
  })
}

And here is an example of using it:

const buf = []
const s = new TransformStream({
  transform: (d, e, cb) => {
    buf.push(d)
    cb()
  },
  flush: (cb) => {
    const json = JSON.parse(Buffer.concat(buf).toString('utf8'))
    // transform json...
    const b = Buffer.from(JSON.stringify(json), 'utf8')
    res.setHeader('content-length', b.length)
    cb(null, b)
  }
})

// Send file
send(req, '/file.json').pipe(transformResp(res, s))

Obviously this is not quite complete, there are other properties you would need to proxy to get the full behavior of the stream, but it illustrates the general point. With this I am able to use the same transform stream in HEAD requests and when I proxy to an upstream server, which was my goal with adding the transform option.