Bullet/links/gotcha emoji for limitations on s3fs-like tools

open-guides / og-aws

📙 Amazon Web Services — a practical guide

Creative Commons Attribution 4.0 International

35.73k stars 3.88k forks source link

Bullet/links/gotcha emoji for limitations on s3fs-like tools #393

Open rdymek opened 7 years ago

rdymek commented 7 years ago

S3 should never be attempted to be mounted as a filesystem. This is a bad idea and will result in very poor performance. s3fs and any other tool like mentioned in this are attempting to use s3 in a way it should not be used.

s3 is not a file system, and is object not block level. If one wishes to use it this way, there are methods which are considered best practice.

One option is to use the s3 sync command. This allows a local location sync with whats in s3 on a periodic (cron) basis.

Additionally AWS offers the use of Storage Gateway or File Gateway which are far superior way to use S3 as a file system and is a native aws product. These services are designed to operate in this way.

Any tool attempting to mount s3 in the way mentioned here currently, is against best practice, and considered bad practice.

bgdnlp commented 7 years ago

I don't think this is the place for such debates and I believe this issue should be closed. If you'd like to open a pull request to better explain the downsides of using S3 as a filesystem, sure. But my understanding is that you want to take down the whole section, and I disagree with that.

These tools clearly have their place, otherwise there wouldn't be a bunch of them and in active use. Sometimes people don't care about performance as much as they care about something being simple and working a certain way without much fuss. There are pros and cons for anything.

Try using aws s3 sync on a bucket that is a few terrabytes in size with a lot of small files.

Storage gateway would be great, except it was designed to compete with storage drives and is way too expensive. Look at the requirements: http://docs.aws.amazon.com/storagegateway/latest/userguide/Requirements.html . About $170 per month for the ec2 instance alone. Plus everything else on top. Compared to a NetApp it might be cheap. Compared to a $10 t2.micro that mounts an S3 bucket and provides SFTP access? If it works and that's all you need, the choice is obvious.

I wish people would stop using the words "best practice" to mean "the only true way". Common practice" is better suited for what it actually means. Yes, it's not common practice. Yes, you should know what you're getting into and what the limitations are, as always. Doesn't mean you shouldn't do it, ever, at all. Right tool for the job and all that.

rdymek commented 7 years ago

On Wed, Mar 15, 2017 at 1:09 AM, Bogdan Luput notifications@github.com wrote:

I wish people would stop using the words "best practice" to mean "the only true way". Common practice" is better suited for what it actually means. Yes, it's not common practice. Yes, you should know what you're getting into and what the limitations are, as always. Doesn't mean you shouldn't do it, ever, at all. Right tool for the job and all that.

This is not a debate, as its direct from Amazon as being bad practice. I am an amazon Trainer and consultant for many fortune 500 companies. This is bad practice. The reason these tools exist is that many people do not know how to use s3 properly and these tools are terrible for that purpose.

Agreed that the storage gateway option has a small cost to it; but keep in mind aws is an enterprise product, not consumer grade. It in no way is meant to compete with Godaddy as an example. These tools may meet a need, albeit very poorly. In an enterprise, no enterprise should consider using these tools. What is being described here is more for the consumer to use these to 'save money' even if at an expense of doing it poorly.

So at the very least, I would add some commentary here of the con's to using these tools and that they are not meant for the enterprise. An enterprise should have no problem creating a storage / file gateway for these purposes. I have seen some very big operations suffer majorly by using these types of tools. This is not a discussion around opinion; this is a discussion around whats considered right and wrong in AWS (directly from AWS). I, nor Amazon, would ever recommend the use of these tools in an enterprise. For someone running a blog? Perhaps. But even then the use of these tools tells me the architecture needs work.

The best design is to simply leave the data on s3 and let it serve the files directly vs. storing them on the local server. As soon as you sync with the local server, you lose scaling, high availability, reduce the file security (s3 is more secure than your ec2 instance), you make it far less highly available and you waste money on ebs storage to replicate something from s3 which is a fraction of the cost.

rdymek commented 7 years ago

Whats cheaper than an sftp t2.micro instance mounting s3? Not running the t2.micro at all and letting s3 be the source removing sftp all together. Simply send files directly to s3. Thats part of the point; the use of these tools do nothing than highlight a more expensive, less efficient, less scaleable design.

On Wed, Mar 15, 2017 at 7:04 AM, Ryan Dymek rdymek@gmail.com wrote:

On Wed, Mar 15, 2017 at 1:09 AM, Bogdan Luput notifications@github.com wrote:

I wish people would stop using the words "best practice" to mean "the only true way". Common practice" is better suited for what it actually means. Yes, it's not common practice. Yes, you should know what you're getting into and what the limitations are, as always. Doesn't mean you shouldn't do it, ever, at all. Right tool for the job and all that.

This is not a debate, as its direct from Amazon as being bad practice. I am an amazon Trainer and consultant for many fortune 500 companies. This is bad practice. The reason these tools exist is that many people do not know how to use s3 properly and these tools are terrible for that purpose.

Agreed that the storage gateway option has a small cost to it; but keep in mind aws is an enterprise product, not consumer grade. It in no way is meant to compete with Godaddy as an example. These tools may meet a need, albeit very poorly. In an enterprise, no enterprise should consider using these tools. What is being described here is more for the consumer to use these to 'save money' even if at an expense of doing it poorly.

So at the very least, I would add some commentary here of the con's to using these tools and that they are not meant for the enterprise. An enterprise should have no problem creating a storage / file gateway for these purposes. I have seen some very big operations suffer majorly by using these types of tools. This is not a discussion around opinion; this is a discussion around whats considered right and wrong in AWS (directly from AWS). I, nor Amazon, would ever recommend the use of these tools in an enterprise. For someone running a blog? Perhaps. But even then the use of these tools tells me the architecture needs work.

The best design is to simply leave the data on s3 and let it serve the files directly vs. storing them on the local server. As soon as you sync with the local server, you lose scaling, high availability, reduce the file security (s3 is more secure than your ec2 instance), you make it far less highly available and you waste money on ebs storage to replicate something from s3 which is a fraction of the cost.

bgdnlp commented 7 years ago

S3 buckets are public. Some people/companies have a problem with that and want to limit access, so using S3 directly is not the right choice. Storage gateway can mitigate that problem, but not everyone is a Fortune 500 company and can shrug off hundreds of dollars every month on something that has "cheap" as one of the main arguments going for it (S3). Similarly, not everyone has Fortune 500 problems, requirements or resources. For some, an EC2 instance mounting a bucket is good enough for the purpose as long as all it does is upload/download, no file-like manipulation.

The guide starts the section about S3 file systems by explaining that S3 is not a file system and it shouldn't be used like that. So the commentary is already there.

Don't get me wrong, I'm just a random guy on the internet and my opinions are my own, don't represent the guide in any way. That said, your insight is welcome. You could join the Slack channel and discuss there. Pull requests are also welcome as far as I've seen.

FernandoMiguel commented 7 years ago

@bgdnlp s3 buckets arent public by default (well, no more than any other element of Public Cloud). assuming you trust amazon trust model, you are in total control of permissions and access level.

QuinnyPig commented 7 years ago

Completely agree that it's not a best practice-- but it is a reality of how companies use S3.

The disclaimer that's there is a good middle of the road; condescendingly telling people "you're doing it wrong!" isn't going to win hearts and minds, and decreases the utility of the guide to many real-world scenarios.

bgdnlp commented 7 years ago

@bgdnlp https://github.com/bgdnlp s3 buckets arent public by default (well, no more than any other element of Public Cloud). assuming you trust amazon trust model, you are in total control of permissions and access level.

By public I mean that they are accessible from the public internet and they are not subject to VPC restrictions. Which means that (1) an attacker would only need an authorized user's credentials to gain access, which goes against multi layer security, and (2) it's an access point for an inside user to exfiltrate unauthorized data.

The only way I know of to protect a bucket on the network level is to whitelist an authorized set of IP addresses. Assuming that isn't practical, another way I know of is to set up some kind of gateway to S3 and allow access from that gateway only. Something the storage gateway would be great for, except for the price. Yet another way involves DNS hijacking, but I'm not going to go there.

Happy to be educated. So far no consultant was able to give me a better solution.

FernandoMiguel commented 7 years ago

@bgdnlp

The only way I know of to protect a bucket on the network level is to whitelist an authorized set of IP addresses. Assuming that isn't practical, another way I know of is to set up some kind of gateway to S3 and allow access from that gateway only. Something the storage gateway would be great for, except for the price. Yet another way involves DNS hijacking, but I'm not going to go there.

a bucket can be made accessible only from a VPC (s3 peering). then all you need to do is have a jump host in that VPC (or VPN) to access those files.

DavidTPate commented 7 years ago

@rdymek Could you please provide us with a link + quote to where this is mentioned as a bad practice by Amazon. I think it would be helpful to include within the current document for people to take a look at.

This is not a debate, as its direct from Amazon as being bad practice.

bgdnlp commented 7 years ago

a bucket can be made accessible only from a VPC (s3 peering). then all you need to do is have a jump host in that VPC (or VPN) to access those files.

That's what I said. I think. Gateway, proxy, jump host, whatever you want to call it. You upload the file to that host and the host uploads it to the bucket. Unless you mean something that does some kind of port forwarding, in which case still that's what I said, DNS hijacking. Because an S3 bucket name will always resolve to a public IP address, in which case the jump host will be ignored. Unless you make your DNS resolve the name to an internal IP or you get fancy with routing. Did I miss anything?

rdymek commented 7 years ago

1) s3 direct access is much more secire thwn s3fs (meets just about every/anycompliance requirement). By introducing s3fs you actually weaken the security tremendously. 2) IAM or bucket policies can control access to allow for conditions restricting to only certain addresses if wanted/required. In fact conditions could go far more granular even including time of day, etc. 3) when using s3fs you are killing your performance. S3 can handle far more throughput than a t2.micro could ever dream of doing here. Plus its double performance penalty (write to the ec2 instance, then write to s3). You will be comstrained to the bandwidth and write limitations of the t2.micro. 4) Using just s3 alone is cheaper than s3 + t2.micro 5) Most ftp clients now allow direct comms with s3. Though not ftp itself, the clients typically support this natively. 6) simply install the aws cli and do streight copies to s3. This will perform better than s/ftp and and is a simpler login/command structure. One commamd to copy a file to s3, vs multiple commands when using ftp. 7) Auditability. S3 is very easy to audit between s3 access logs, and cloudtrail logs. Lost capabiloties under s3fs. 8) Encryption. Though s3fs supports some encryption its quite frankly awful. S3 encryption allows for integration with kms, hsm, your keys, amazon generated keys, etc. All with ZERO performance impact. 9) ftp is dead and convoluted. Direct s3 access from anywhere its needed without vpn's or any issues scaling. Why on earth would I use ftp?

Bottom line, there is ZERO reason to use these tools. And I could take your t2.micro instance, save you money, increase security, and increase performance.

My point is that simoly suggesting these tools perpetuates bad architecture. I teach for AWS, I always mention these tools. Then say, "there may be very far edge cases where this could make sense; but if you are tempted to use them, think twice because its usually a bad design leading up to the use of these tools.

I will dig up references as I can. But I personally work directly with AWS engineers and every one of them will say these are a terrible idea and never lead to good things. Test it for yourself; open am aws ticket and ask them.

I will follow up with references.

Ryan

On Mar 15, 2017 2:43 PM, "Bogdan Luput" notifications@github.com wrote:

a bucket can be made accessible only from a VPC (s3 peering). then all you need to do is have a jump host in that VPC (or VPN) to access those files.

That's what I said. I think. Gateway, proxy, jump host, whatever you want to call it. You upload the file to that host and the host uploads it to the bucket. Unless you mean something that does some kind of port forwarding, in which case still that's what I said, DNS hijacking. Because an S3 bucket name will always resolve to a public IP address, in which case the jump host will be ignored. Unless you make your DNS resolve the name to an internal IP or you get fancy with routing. Did I miss anything?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-guides/og-aws/issues/393#issuecomment-286840931, or mute the thread https://github.com/notifications/unsubscribe-auth/AHipPbKX35BglUD7N2n2lcZiabV5_IVgks5rmDFlgaJpZM4MdJmE .

bgdnlp commented 7 years ago

Agree with about half your points, but none of them solve the problem I mentioned. Though it sounds like you didn't read my example, which is understandable. Specifically:

9) ftp is dead and convoluted. Direct s3 access from anywhere its needed without vpn's or any issues scaling. Why on earth would I use ftp?

S3 access from anywhere is exactly what this use case wants to prevent. It's a point against S3, not for it.

I teach for AWS, I always mention these tools. Then say, "there may be very far edge cases where this could make sense; but if you are tempted to use them, think twice because its usually a bad design leading up to the use of these tools.

Isn't that basically what the guide is also doing? The language is not as strong, but it mentions the pitfalls of using these tools and then lists (some of) them. Since this is a community driven guide, I think that it's good that it takes a more neutral or mildly-opinionated stance. Potentially avoids controversies like this one.

rdymek commented 7 years ago

The point about 9... s3 has the CAPABILITY of being accessed from anywhere, but could be locked down directly to single IP's (a single line in an IAM constraint), single times of day, heck even for just a single transaction. Please tell me WHY FTP would be better here?

But even if FTP is wanted, most FTP clients support direct s3 now. Again, why/how would FTP be a better option?

Ryan

On Mar 16, 2017 2:09 AM, "Bogdan Luput" notifications@github.com wrote:

Agree with about half your points, but none of them solve the problem I mentioned. Though it sounds like you didn't read my example, which is understandable. Specifically:

ftp is dead and convoluted. Direct s3 access from anywhere its needed without vpn's or any issues scaling. Why on earth would I use ftp?

S3 access from anywhere is exactly what this use case wants to prevent. It's a point against S3, not for it.

I teach for AWS, I always mention these tools. Then say, "there may be very far edge cases where this could make sense; but if you are tempted to use them, think twice because its usually a bad design leading up to the use of these tools.

Isn't that basically what the guide is also doing? The language is not as strong, but it mentions the pitfalls of using these tools and then lists (some of) them. Since this is a community driven guide, I think that it's good that it takes a more neutral or mildly-opinionated stance. Potentially avoids controversies like this one.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-guides/og-aws/issues/393#issuecomment-286975237, or mute the thread https://github.com/notifications/unsubscribe-auth/AHipPUhTJf8nNaSE2l4iNapEtMJtVSq6ks5rmOBEgaJpZM4MdJmE .

bgdnlp commented 7 years ago

I already said why/how in my previous posts. Company wants employees to be able to access S3 on the road or at home, but without exposing S3 to the world on a network level. Or exposing the world to S3. That is normally solved with VPN. But that doesn't work with S3. IP whitelisting is not practical when it would mean whitelisting half the world, or even one country. Again, see my previous replies.

FTP isn't wanted. SFPT is one way of doing it. What is wanted is "internal-only" S3 access. Which is not possible by definition, since S3 is only present in the public space.

FernandoMiguel commented 7 years ago

@bgdnlp FTP isn't wanted. SFPT is one way of doing it. What is wanted is "internal-only" S3 access. Which is not possible by definition, since S3 is only present in the public space.

how is (s)ftp any more secure than s3 policies? it's also exposed to the world. it will grant the user access to all the same files. S3 can be access from a VPC only, so you can whitelist a VPN endpoint

bgdnlp commented 7 years ago

No, it isn't exposed. Yes, you can whitelist it, but it's useless, as I already said.

But nevermind, that's been quite enough of that.

rdymek commented 7 years ago

S3, using aws cli, using secret keys along with MFA. No reason to manage IP's, and meet the requirements of the customer. Exposed keys are a non-issue with MFA. One could take this a step farther with federation with Active Directory and use of roles. Point is, these tools do not solve a problem, but create more problems than they solve. So why would they be suggested?

The FTP example is not unique (you aren't the first person/company wbo wants s/ftp, but again is not the most secure, not the cheapest, and is the poorest performing model. And if anyone thinjs sftp is more secure than s3, even in this scenario offered you are poorly mustaken.

On Mar 17, 2017 10:22 AM, "Bogdan Luput" notifications@github.com wrote:

No, it isn't exposed. Yes, you can whitelist it, but it's useless, as I already said.

But nevermind, that's been quite enough of that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-guides/og-aws/issues/393#issuecomment-287384194, or mute the thread https://github.com/notifications/unsubscribe-auth/AHipPUq3xJO74NUcSfq0yhmHggaqzSrhks5rmqUrgaJpZM4MdJmE .

JCBarry commented 7 years ago

I've been following along here a bit and having just reread the section on S3 in question here, I feel like the guide adequately disclaims that there are inherent issues with this approach. However, in the spirit of educating people, it needs to be mentioned as a possible approach and obviously needs to be vetted for each individual use case.

Nothing at AWS is one-size-fits all. This guide's mission is to educate folks with what's possible and how it's possible so I think the section is fine as written with this context.

At most, I could see making a stronger statement that this isn't a great idea for the already mentioned drawbacks and I'd be happy to review a pull request with such language (as I think @bgdnlp already said).

jlevy commented 7 years ago

Lot of discussion here, which is fine, but to bring it back to being more actionable:

Many of these s3 filesystem-type tools are used (for good or ill), so we need to mention them; we want to give an overview and guidance and do not want to remove mention of popular tools, even bad ones, as that makes the guide less useful
We do want to mention gotchas and reasons that tools may be ill-advised, and should link to blogs or opinions on that whenever possible (and include one of our handy warning emojis)

The current bullet "S3 as a filesystem" at https://github.com/open-guides/og-aws#s3-tips does mention limitations, but certainly could be improved.

Two actionable items that might arise from this:

a small, focused PR that adds a bit more language there and a 🔸 emoji that highlights downsides more clearly, preferably with links discussing why it's problematic (from AWS and/or elsewhere) and perhaps listing alternatives (@rdymek or others, a PR on this is welcome, and I'll leave this issue open for that)
a broader rewrite that helps a user navigate the question of "how do I get filesystem-like access to S3" and summarizes pros and cons including storage gateway etc. Might even be a matrix or list of bullets with cost/features/scalability. (created #395 for this)

minac commented 7 years ago

Let me start by saying that I agree that s3fuse is not how S3 is supposed to be used. It's an eventually consistent object store which simulates a filesystem and not a block store. That said, until very recently certain regions did not have access to EFS, and needed/wanted to access the same data from multiple EC2 instances.

So the tips I found along the way, in case you still want to use s3fuse would be:

set the mount point in /etc/fstab, the server will reboot eventually
use the _netdev flag, that way the OS will wait for network before trying to mount the "drive"
use the iam_role flag instead of having credentials on the EC2 instance
use the noatime flag so that you don't waste money on useless calls to update the access time of the files since it is not a real/traditional filesystem
last but not least, use the flag use_cache=/tmp, not using cache will easily skyrocket your costs

I hope this helps.