nginxinc / nginx-s3-gateway

NGINX S3 Caching Gateway
Apache License 2.0
523 stars 129 forks source link

Gateway listing only 1000 first object of prefix/dir #150

Open anisimovdk opened 1 year ago

anisimovdk commented 1 year ago

Describe the bug Listing only 1000 first object of prefix/dir.

To Reproduce Steps to reproduce the behavior:

  1. Create more than 1000 object in bucket prefix/dir
  2. Try to list files tis nginx-s3-proxy
  3. You will see only first 1000 objects

Expected behavior List all files in current prefix/dir. Or display max files based on some settings w/wo pagination.

Your environment

dekobon commented 1 year ago

I just confirmed with the AWS documentation that the maximum number of results that can be returned is 1k. I imagine that the only way to solve this is to add paging support to the directory listing style sheet and list proxy javascript. Perhaps, the XSL stylesheet could grab the key name of the last key returned and pass it as part of a link on a query parameter, whereby nginx will then issue a request to S3 with that key name set as the marker.

What comes to mind is how to sanitize that input such that it does not become a vector for injection attacks.

This would be a great PR.

xquek commented 8 months ago

Hi folks, I am actually interested in this feature and is happy to give it a try with its implementation.

I was wondering what are your recommendation for santaizing the inputs?

Will storing the nextToken from the response from S3 in a cache and comparing it to the token in the request as well as request paramaters be sufficient?

EDITED: i just realized that keyval cache is only available for paid version of nginx.

4141done commented 8 months ago

Hi @xquek , thank you very much for your interest in helping out!

In terms of storing references to the nextToken or ContinuationToken cursor returned by s3, since this issue was filed, njs has introduced the [SharedDict](https://nginx.org/en/docs/njs/reference.html#ngx_shared) (more info here) object that can be used on both the paid and free version.

As a new(ish) maintainer I haven't dug too deeply into the xslt side of things but if we wanted "next" and "previous" functionality we could do something like pushing and popping the cursor into the shared dict. xslt_string_param may be helpful in getting things from javascript land into the xslt template.

As for sanitization, maybe @dekobon could comment more on their concerns. Populating the "next" and "previous" links in the template using values from the s3 object listing response seems safe enough to me but it's possible I'm missing something.

Happy to continue discussing as you get into the change but these are just my initial thoughts.

Ref: Another comment I made on a related discussion highlighting some files to look at: https://github.com/nginxinc/nginx-s3-gateway/discussions/207#discussioncomment-8550027

dekobon commented 8 months ago

My thoughts on sanitation is need to be sure that only your expected inputs make it to the eventual S3 URL. Otherwise, arbitrary access to the S3 bucket may be possible. I think characters like the following may be the most problematic: %@?+. As such, we need to make sure things are properly URI encoded and that we do not accept unencoded special characters.

As for storing the cursor for the next or previous token, it seems like you could write that to the HTML output generated by the XSLT page. I may be missing something, but I'm not sure you need to use a shared dictionary for that.

I imagine something like a link that goes: http://mys3gatewayhost.foo/dir1/dir2/index.html?next=marker If that link was written to the html output generated as part of an index page, you would have stored the state sufficiently between requests.

4141done commented 8 months ago

@dekobon thank you for your perspective. The assumption around a cache comes from the fact that s3 does not send any kind of cursor representing the previous page, only the next page. So if you wanted to enable "previous" button functionality you'd need to keep track of at least one token for the previous page.

dekobon commented 8 months ago

Couldn't the token for the previous link be passed as a URL parameter when you click the next button? This value could then be potentially passed to the XSL sheet.

4141done commented 8 months ago

Couldn't the token for the previous link be passed as a URL parameter when you click the next button? This value could then be potentially passed to the XSL sheet.

Ah right because the next link will go through the gateway as well so we'd be able to pull it out maybe using a js_set and template it back in using something like xslt_string_param? That makes sense to me. We'd have to test whether s3 will put up with us passing an unknown param along or whether we want to go to the trouble of massaging the querystring we pass on to s3.

4141done commented 8 months ago

Final update, I spoke with @dekobon offline and wanted to clarify a misunderstanding I had regarding sanitization:

Currently the s3 gateway strips any query parameters before passing requests on to S3 (this was the part I didn't know).

So in order to pass the continuationToken or nextToken we would need to create an allowlist of query parameters to be passed on, and properly uri-encode any links we generate that contain query parameters.

xquek commented 8 months ago

Thanks! @4141done , just wondering if you would like me to pick this up ! thank you!

4141done commented 8 months ago

Please do! Excited to see someone tackle this one!