storj-archived / sips

Storj Improvement Proposals.
GNU General Public License v3.0
14 stars 11 forks source link

Distribute Shards via Geo-Loc-IP #30

Open MeijeSibbel opened 7 years ago

MeijeSibbel commented 7 years ago

Kevin:

Storj is distributed data. In order to prevent the data from being centralized under a data center that has the resources to produce hundreds of thousands of nodes, a solution must be brought that can provide for the data to be distributed with a metric that doesn't rely on node id closeness to the shard hash mixed with response time.

We (LittleSkunk and I) feel that the best solution would be to use the geo-location features of IP addresses, to prevent shards from concentrating in one region/area. If a German DataCenter decides to run 500,000 nodes. The Bridge would still distribute data evenly to each region. Germany gets 1, China gets 1, Saudi Arabia gets 1, Mexico gets 1, etc.

I realize this adds some complexity. However, doing nothing to prevent this will eventually cause the data to centralize and make Storj pointless. (Other than the feels about supporting the little guy) It is better to prevent this now, as part of the architecture, so that centralization can't happen. And the network stays distributed.

Plus, I think you guys are thinking about using regionalized IP for renters who have requirements to keep their data in a specific region. So this could simply be an outgrowth of that. If you're going to specifically aim shards at a region, you could also specifically diversify shards among regions, so as to prevent all the data going to the place that has the most nodes.

The goal here is to prevent data centralization. If you have a better way to do that, groovy. Let's do that then. But I think doing nothing is a mistake. Storj needs to stay distributed. That's all. Thanks!


Meije:

What is described by Kevin above is specially important once farmers are selected based on performance metrics and/or Geolocation. Say a renter Lives in the Netherlands, he is tied to national data storage laws and has to store the data within the Netherlands. He needs gigabit farmers. The bridge would then try and select farmers within the Netherlands that qualify to the selection parameters, however, this renter has a data-center only a few Km away with thousands of nodes, this datacenter would now get all or almost all of his data, meaning that his data is now centralized. With the idea above checking if a shard from a specific file is already stored on a specific GeoIP-location and then ignoring that location and selecting another location within the country ensures that the data is always stored in a decentralized fashion and prevents data loss if the data-center goes offline. Things like Geo-IP zoning like is used with "no fly zones" would be a good option.

There is one concern with this technique which is that many ISP's provide IP's that point to one specific hub, so IP tracking all nodes on that hub would just point to one geographical location.

braydonf commented 7 years ago

Some of this is detailed in SIP6 in the future section. Perhaps we can update that to reflect these thoughts.

AndreyNazarchuk commented 6 years ago

why does it matter if all the data is in the same place if its all split up and encrypted? I would think it makes it marginally less secure to have one database. For a big datacenter, they get hacked more often but also have better security, for an individual farmer, they get hacked less often but are logically less secure. Shards should get distributed based on ping time so that the user can access quickly. This would obviously make more sense if there was more than one bridge though.

tempestb commented 6 years ago

@AndreyNazarchuk There is some merit to having some nodes closer for latency reasons, but there are also reasons to have data distributed to different regions of the world for security. If all of your data is held close to your location and there is a disaster, you lose all your data. This is why major data centers provide Local (Data Center), Regional (Your local regions Data Centers), and Geo (Data Centers across country borders, typically) The services get more expensive the further out you place your data.

Storj intends to (at some point) allow renters to choose regions they want to distribute their shards to. So if you want your data local, you'll have that option. Where as, if you do business in different areas of the world, you may want to distribute your data more geographically. Every user has different needs.

ghost commented 6 years ago

This feature would be of great value to anyone trying to store data when there are constraints on where(country) you can store the data. I know there are some countries like Sweden that forbids companies from moving citizen data outside of Sweden. Will be keeping an eye on this feature 👍