ipfs / infra

Tools and systems for the IPFS community
MIT License
132 stars 41 forks source link

Storage nodes #89

Closed jbenet closed 7 years ago

jbenet commented 8 years ago

this way i can help pin important things, and though not on the backbone for fast speed, i can add manually manage the disks.

jbenet commented 8 years ago
davidar commented 8 years ago

:heart:

I'm just lucky to have awful outbound bandwidth so that I'm not tempted to do this myself :)

davidar commented 8 years ago

@jbenet After this, feel like building a petabox? ;)

jbenet commented 8 years ago

Have seen them! they're so cool. they heat up the archive's main hall.

whyrusleeping commented 8 years ago

@jbenet what kind of redundancy do you want? If youre doing 4 disks, I would recommend raid10, which cuts your total storage in half, but if a disk fails youre fine. If you put 4 disks in raid5, and you lose one, its going to take 5-10 times longer to repair than raid10 will. (which is quite a long time when we're looking at 6TB disks). If youre doing 5 or more disks, you will want to do raid6 (2xparity).

But if you dont care about redundancy then press the raid0 button and lets take this train to storage town!

As for the drives themselves, i personally use WD reds According to the reviews, people have issues with them being DOA, but i'm 10 for 10 on getting good drives.

While I dont use them (mainly because they used to be a lot more expensive), Hitachi (now HGST) makes some of the best drives around, and you cant go wrong buying them. http://www.newegg.com/Product/Product.aspx?Item=N82E16822145973&ignorebbr=1&cm_re=ppssNASHDD-_-22-145-973-_-Product

whyrusleeping commented 8 years ago

I've been debating building another storage node for myself, but I think im going to wait for the SSDPocalypse.

cryptix commented 8 years ago

Personally I don't like all that plugging laying around. We made good experience using HP ProLiant MicroServer with FreeNAS on them (zfs <3)

ghost commented 8 years ago

@davidar for the short/middle term, could you recommend any hoster offering large HDDs, and I'll get a node with them?

whyrusleeping commented 8 years ago

@cryptix mmmmm, those are pretty. only four drive bays though... and it doesnt appear to have anywhere for a small boot drive, so unless we feel like being super brave and booting from zfs, we could only get three drives in raid, which limits us on space.

I do agree with @cryptix though that having the drives in an external enclosure and having them plugged in via usb seems super sketch.

There is this guy: http://www.amazon.com/Synology-DiskStation-Diskless-Attached-DS1813/dp/B00CRB53CU

eight drive bays and looks to run linux. Super tasty looking :)

cryptix commented 8 years ago

@whyrusleeping We removed the top CD-Drive (spinning plastic, lol) and replaced it with the drive to boot from. I'm not sure if this is still possible on the gen8 models, as it looks like they have a smaller CD drive.

davidar commented 8 years ago

@lgierth I've heard about https://www.ovh.com/us/dedicated-servers/storage/ but can't vouch for them personally.

ghost commented 8 years ago

Hetzner now has 6x4TB boxes for 82 EUR/month, with 30TB outgoing traffic: https://www.hetzner.de/hosting/produkte_rootserver/sx61

I'd just get one of these and we'll have some headroom for a while. @whyrusleeping would like RAID6 on ZFS or btrfs.

kyledrake commented 8 years ago

I have experience here.

Hetzner is lower priced, but has low bandwidth caps (100Mbit). If you get DDoSed they null route your server. They've also been whining about the netscan stuff (you'll need to filter out local IP scans).

OVH FS is much better (500Mbit busted to 1Gb with 160Gb DDoS mitigation), I've had very good success with them for infrastructure, particularly their "enterprise" dedicated servers. The one annoying thing is that their billing is not automatic which is pretty annoying.

I use OVH as part of the Neocities infrastructure, and have a Hetzner server for backups.

jbenet commented 8 years ago

Sounds good, maybe let's try out OVH then?

kyledrake commented 8 years ago

OVH and Hetzner as the "different centralized host" backup might be a good approach here.

ghost commented 8 years ago

OVH has conditions for the bandwidth guarantee though, and I can see IPFS fit into several of them:

Bandwidth is no longer guaranteed when the server or servers are used for the following activities:

  • Anonymizing service (proxy), CDN service
  • Storage or file exchange platform (especially but not exclusively cyberlocker)
  • Streaming
  • Download platform
  • Service for bypassing limits imposed by the download platforms
  • VOD viewing platform (videos on demand)
  • Servers used for downloads and file sharing on P2P networks (especially non-exhaustive: Seedbox).

I'd ask support to clarify if IPFS is affected by these, or do you wanna do this as an existing customer @kyledrake?

ghost commented 8 years ago

Reopening this issue because OVH's UI is hell. Any other suggestions for dedicated storage providers?

Otherwise I'd just say let's go with S3? Is the S3 blockstore a thing?

kyledrake commented 8 years ago

@whyrusleeping and I poked at the S3 blockstore a few days ago, and ran into some issues. Apparently the version of the S3 library implemented is now obsoleted, but there's a new fork that is caught up and improved.

@whyrusleeping was going to look into updating it but then discovered the license was LGPL and could be a problem? @jbenet I think you might be the decision maker on that one.

It would be really great to use S3 blockstore, assuming we can get it to work fast enough (S3 is slooooow and not designed for data chunking - careful attention needs to be paid to performance issues, and caching of some sort may be required).

whyrusleeping commented 8 years ago

@kyledrake lets switch to using this lib: https://github.com/rlmcpherson/s3gof3r it meets my criteria of:

  1. not LGPL
  2. not 100 bajillion lines of code
kyledrake commented 8 years ago

A bit off topic, but some thoughts to consider RE S3 performance:

S3 as I understand it allows Keepalive and HTTP parallel Multipart upload. Additionally as I understand it, S3 only charges once an HTTP method is executed, not for the connections themselves (needs proof?)

In combination, I believe an S3 library could be built that has lower latency for our use case than the current models allow.

My idea is as such:

For Keepalive, you could optimistically create a thread pool (or whatever the Go equivalent is) of active Keepalive connections to S3 (testing would reveal a good number, but try one per core to start?), and then reuse them for I/O, and refresh the pool as necessary. This depends on those keepalive connections staying up for a sufficient amount of time without any activity, an S3 policy question I haven't found the answer to yet.

Parallel upload could possibly be used to improve transfer performance as well, though at 256KB chunks it may not make a big difference (testing may be needed to tune this).

This, combined with de-structuring the datastore (don't split chunks into directories based on hash, it's not neccessary for S3 because it's a K/V store), and I think we'll start to see some more reasonable performance numbers for using IPFS with S3.

Additionally there may need to be some sort of caching layer for the keys locally. We don't want IPFS to constantly hit S3 to check for hashes as P2P requests for content come in, it's going to potentially add up to an expensive operation and could have performance issues. Unfortunately I don't believe there is an event-driven way for S3 to announce new data to the local node, so this wouldn't be ideal for a cluster of IPFS nodes using the same datastore.

I'm going to do a PoC today with some high level code with my "keep alive pool" idea and see how the AWS S3 API reacts to it. For now, definitely try s3gof3r! It seems to support parallel multipart already.

https://dzone.com/articles/amazon-s3-parallel-multipart https://aws.amazon.com/s3/pricing/ https://aws.amazon.com/blogs/aws/amazon-s3-multipart-upload/

/CCing @rlmcpherson incase this is interesting to him. :smile:

kyledrake commented 8 years ago

I haven't been able to pin down the timing for keepalive after headers, but pre-headers has been fairly consistent:

$ time nc s3-us-west-2.amazonaws.com 80

real  0m23.068s
user  0m0.000s
sys 0m0.000s
$ time nc s3-us-west-2.amazonaws.com 80

real  0m23.073s
user  0m0.000s
sys 0m0.000s
$ time nc s3-us-west-2.amazonaws.com 80

real  0m23.249s
user  0m0.000s
sys 0m0.000s
$ time nc s3-us-west-2.amazonaws.com 80

real  0m23.175s
user  0m0.004s
sys 0m0.000s
$ time nc s3-us-west-2.amazonaws.com 80

real  0m23.151s
user  0m0.000s
sys 0m0.000s
$ time nc s3-us-west-2.amazonaws.com 80

real  0m23.187s
user  0m0.000s
sys 0m0.000s
$ time nc s3-us-west-2.amazonaws.com 80

real  0m23.119s
user  0m0.000s
sys 0m0.000s

I'll stop spamming this ticket for now, let me know if you'd like me to file a ticket for this topic somewhere.

rlmcpherson commented 8 years ago

@kyledrake I haven't read over this entire issue in detail, but a couple thoughts regarding S3 performance and s3gof3r:

kyledrake commented 8 years ago

@rlmcpherson That's great. I'm looking forward to trying it out!

jbenet commented 8 years ago

can we move the S3 discussion to some other issue? -- otherwise we'll be always off topic here. suggest a note in https://github.com/ipfs/notes/issues