stellar / stellar-protocol

Developer discussion about possible changes to the protocol.
517 stars 303 forks source link

Deprecate Data Entries #221

Open JeremyRubin opened 5 years ago

JeremyRubin commented 5 years ago

See https://github.com/stellar/stellar-protocol/pull/199 for some discussion.

The data entries API doesn't really make sense for most use cases -- in most cases; the person is better served doing the thing off-chain.

The only documented use case I could find with a cursory search is https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/stellar-dev/mzIP9OuegyI/bnPP0LVcAAAJ.

Pinning Stellar.toml is a nice idea, but there are so many practical issues with the approach, something much simpler makes more sense -- issuers provide time-windowed certificates which guarantee a period where the toml is up to date and if you issue a contract with a mintime/maxtime matching it you should expect (but not be guaranteed) for it to work.

Compatible Removal

I propose that we soft-deprecate the data entries by removing data entries from the public facing APIs while keeping them in consensus should someone be relying on the functionality currently. We should also strongly communicate the deprecation, and collect community feedback for a period of one year.

After the one year feedback period, a policy layer (read -- the software that nominators use to select transactions, but is not enforced by validators otherwise) should block transactions with data entry modifications.

Then, after an additional period, we can explore significantly increasing the reserve for Data Entries -- this enables contracts relying on Data Entries to still be processed if the person sends additional fund to the account to cover the unplanned excess.

I'm not sure I'd ever advocate full removal of data entries, given that a user may already have a compatibility issue, but perhaps with a long enough horizon for community input we can fully remove the feature.

Repairable Removal

Alternatively, I propose that we, in the next update, fully remove data entries (transactions with that operation are invalid). If there is strong motivation for the feature in the future, I propose that we could then add back an identical feature such that those transactions would again become valid, or a new type of data entry could be added to better match the needs of users. This is semantically somewhat equivalent to dramatically increasing the minfee for a data entry operation (from the perspective of an engineer implementing a system to handle data entries).

MisterTicot commented 5 years ago

I have a lot of data about data entries and their usefulness; I'll try to go to the point.

Why would you store data on a decentralized Ledger?

The same reason that you're posting any operation: you want authenticated information to be publicly available in a resilient manner.

Decentralized database are actually going to solve a lot of security issues that were a pain to deal with so far.

What are the possible use-cases?

Simple ones:

Advanced one (dApp based on a side-chain):

How does it fit in Stellar?

This feature is needed so we extend account configuration without having to modify the protocol. It's what is used in "Bootstrapping Multisig Coordination", and it could have been used for "homeDomain" and "inflationDest" as well.

In the future, we can expect external services to introduce vendor-specific configuration as well. We can also expect new use cases to emerge from this feature; for example, I'm promoting the idea of storing webapp signatures on the ledger in the web standard milieu.

Anything else?

By simplifying the process of creating custom network, Stellar got its chances in becoming the mySQL 3.0. The consequences of such an achievement would be absolutely huge for the project.

Can this feature be abused?

Yes, but...

In term of spam, this feature is more expensive that others operation like sending a payment of a stroop with an URL as memo or spamming tiny trades.

It term of message sending potential, the cost of one base fee per 64 bytes is already a significant filter on what one would be willing to send over the ledger.

Can we prevent arbitrary data publishing on the ledger?

No.

There are many ways to publish arbitrary data on the ledger, some of them being cheaper than manageData. Removing the manageData operation will only prevent legit uses.

JeremyRubin commented 5 years ago

Thanks for taking the time to respond.

The same reason that you're posting any operation: you want authenticated information to be publicly available in a resilient manner.

So there are two claims here: the data is authenticated and the data is publicly available.

Authenticated data can be handled by bundling the data with signatures from the account in question. Public availability can be handled by a separate DHT, perhaps with some notion of version control.

Notably, stellar-core has no requirement (afaict) to store more than 32-bytes per data entry -- the implementation can discard the data immediately and keep a 32 byte (even salted!) hash of the key, without breaking any consensus rules.

What are the possible use-cases?

Simple ones:

Storing your public key

Is this a different one than mentioned? Storing your wallets address Storing dApps configuration Storing hash of software/files/source-code

Advanced one (dApp based on a side-chain):

Blog / forum / decentralized social media
Decentralized DNS

Again, there is no storage requirement for data entries (beyond hash) for a stellar-core node.

A system which tracks the long-term queryable state should be built as an entirely separate 'image' off of the ledger set. This points to including data as a type of memo or something in a transaction, but not storing account state.

How does it fit in Stellar?

This feature is needed so we extend account configuration without having to modify the protocol. It's what is used in "Bootstrapping Multisig Coordination", and it could have been used for "homeDomain" and "inflationDest" as well.

In both of these protocols, the homedomain serving a stellar.toml is sufficient for the data to be served to the world. Furthermore, the homedomain isn't strictly needed as a separate DHT could include attestations for various accounts as to what their metadata is (incl routing to a homedomain).

In the future, we can expect external services to introduce vendor-specific configuration as well. We can also expect new use cases to emerge from this feature; for example, I'm promoting the idea of storing webapp signatures on the ledger in the web standard milieu.

Storing them in the ledger, maybe, but in the account state, perhaps not.

MisterTicot commented 5 years ago

Would be nice to have at least a solid rational for deprecating data entries before continuing to push that proposal on other threads.

Statement that data is better off-chain is actually not true: on-chain data is a common requirement for dApps - else it wouldn't be called decentralized.

Several services relying on data entries already exist and this feature is actually valuable so except if there's a critical flaw to fix this operation must be maintained.

JeremyRubin commented 5 years ago

I think the main points I'm interested in are:

1) Storing them on chain is expensive, and needs to/will become more expensive over time. We may even explore making reserves quadratic with respect to number of entries (I'm not in favor for various reasons, but it's on the table). 2) The protocol doesn't guarantee availability of the data (no validation state is value dependent) 3) Off-chain data is sufficient, and in many cases, better (no size restrictions, no weird formatting requirements, etc) than on-chain data. 4) Desire to store metadata about accounts which you don't have control over

With respect to the argument that it's decentralized that's a miscategorization. Data entries are actually more centralized than just "information" which can be served by anyone with no central consistency. Getting stellar data requires running a validator or trust. Most of the protocols I've seen would be better off with a signed serialized XDR blob.

It's unfortunate that there are existing services using Data Entries, but I'm happy to help them figure out how to use a more pragmatic approach if they are incapable of doing so themselves.

MisterTicot commented 5 years ago

3 of those points are opinions about how one should design its software.

Now that's clearly not a consensual opinion. You have the right to not like that feature but that's not a good reason to prevent everybody else to use it.

The point about data value not being enforced by the protocol is indeed an issue. That won't be solved by removing this op, though.

On-chain data storage have its use case - this is a known fact and that's why blockchains implementing smart contract also offer data storage capabilities. I don't think we can go forward without you aknowledging this fact.

JeremyRubin commented 5 years ago

Yes, hence I've proposed deprecating but not removing. As stewards of the protocol, there's an obligation to encourage good software engineering.

I agree there is a need to store data on chain -- I'm not sure where you ever got the idea I don't beleive that. However, I think that on-chain data storage should be minimally consumed and I'm yet to see a use case which truly benefited from using data entry in specific.

MisterTicot commented 5 years ago

Well multisig coordination bootstrapping SEP is an excellent exemple. It is fairly well designed by the way - so I'm not even saying that protocol writers shouldn't encourage good design; What I'm saying is that your personal opinion about what is a good design is not necessarily shared by everybody and that there are good designer out there that are using account entries for good reasons.

Now we can play with words but practically if the feature is removed from Horizon before a better alternative is available it will break things badly and this is not a good practice.

JeremyRubin commented 5 years ago

I'd love to work towards an objective framework for smart contract data so that use cases and designs can be evaluated appropriately.

I'm trying in good faith to be as opinion less on this issue as possible and objectively evaluate current data entry practices, but I'm fallible so sorry if you haven't felt that my objectivity is transparent. I very much want people to build easy to use, inexpensive, robust, etc software and pave a forward looking path for the stellar network -- that's my only goal. Data Entries did not, as far as I know, kill my grandfather or something.

I think there's a couple things we can think about w.r.t. data functionality that help us work towards a more concrete evaluation of the relative merits of approaches:

1) Authentication 2) Revocation 3) Consistency 4) Availability 5) Data Structure 6) Cost to Write 7) Cost to Read 7) Scalability 8) Developer Ease of use 9) API Stability 10) Privacy 11) Censorship Resistance

Are there other properties you care about? Naturally, I've biasedly selected a set of issues where I think the benefits of off-chain data win (the causality is reverse though; I think off-chain data wins because of these factors), but if there are categories I haven't considered please share and then we can work towards a more rigorous evaluation.

MisterTicot commented 5 years ago

Of course! My point were not to throw doubt at your intentions but to insist that there are other beings equally motivated & competent who think that data entries are needed. In fact the discussion about the multisig coordination boostrap SEP proved it true.

I'd also underline the fact that blockchain are mostly an unknown territory and nobody knows beforehand which kind of mighty invention will come out of this or that functionality. But we definitely knows that account data entry expose an unique set of property that can't be found off-chain:

It is really two different solutions and there's no such thing as "on-chain is better" or "off-chain is better".

JeremyRubin commented 5 years ago

Ok, so if I can synthesize between our posts:

  1. It respects the signers setup == Authentication
  2. It is public == Availability
  3. It is immutable == Authentication + Consistency
  4. Its history is immutable == Availability
  5. It doesn't depend on external/centralized services == Censorship Resistance + Developer Ease of Use
  6. So it is trustless == Censorship Resistance

So the 12 properties cover all the benefits you think that Data Entries bear?

MisterTicot commented 5 years ago

I suppose, yes.

JeremyRubin commented 5 years ago

For each of those topics I believe off chain data has preferable properties. It's not an exhaustive analysis, happy to go into more detail if you disagree on an individual category.

Authentication

Off-chain data may be signed by the current signers on an account as well as by third party authenticators (e.g., if a field specifies that a service provider is to be used, that service provider may sign to corroborate that the account is a customer). If using the homedomain for hashed content, the data is hash authenticated which is useful for accounts without signers.

On-chain data was signed by the current or prior signers on an account. Current signers may be unaware of the data entries set. Race condition to unset data entries after changing keys.

Revocation

Off chain data can be revoked by using a version number scheme (signing saying it's valid up-to sequence number current + 1 billion and using sequence bump to speed up invalidation), TTL, or by rotating keys. It can be made irrevocable (in some sense) using hash authentication for accounts with no signers.

On chain data can be revoked by removing it from the account in question.

Consistency

Off chain data consistency guarantees are strong given the account owners want consistency, e.g., they properly use version numbers when modifying data.

On chain data consistency guarantees are weak -- data is guaranteed to be consistent within an SCP quorum, but are not guaranteed to be consistent with what a transaction signer anticipated them to be when they created the transaction due to interleavings.

Availability

Off chain data has no inherent availability guarantees. However, with proper mirroring infrastructure, off chain data can be made more available than the Stellar network itself as just serving data can be served via static file servers/edge caching infrastructure and requires reaching only one node which claims to have the data. Classic DHT literature applies...

On chain data is not guaranteed to be externally visable in the protocol. This may be amended, but is not the case now. The data is as available currently as a horizon node on the network. Without talking to a trusted set of horizon nodes, the data is fully unauthenticated (e.g. if served from caches) so we do not include this as a potential for availability.

Data Structure

Off-chain data supports arbitrary data structures.

On-chain is currently limited to (char[64], char[64]) tuples.

Cost to Write

Off chain: the cost to write is very cheap and likely decreases over time.

On-chain: the cost to write is expensive, and likely to increase in price over time.

Cost to Read

Off-chain: the cost to read is very cheap and likely decreases over time

On-chain: the cost of reading trustlessly is maintaining full consensus with the Stellar network, which is likely to increase over time.

Scalability

Off-chain: scales relatively well, see classic DHT literature and availability of global CDNs.

On-chain: central bottleneck. Engenders a trade off between the number of accounts/users and the amount of data per account. Potential DoS vector.

Developer Ease of use

Off-chain: No need to know about fees and reserves, can have a one-click interface (once privkeys are loaded in) to deploy new metadata globally instantly. Data files could be queried by (account_id, key) pairs or by a homedomain specified server.

On-Chain: Need to know about fees and reserves and have enough for an account to add the relevant data entries. More restrictions on data formats. Supporting many different protocols makes merging accounts harder potentially. When critical updates need to be applied to many data entries (e.g., because of a hacked service) then there is a flood of transactions writing data entries which raises fee rates forcing accounts to wait with stale data.

API Stability

Off-chain: fine to support legacy features/versions forever as well as serve multiple versions for compatibility.

On-chain: protocols must be in step with SCP changes. E.g., if reserves increase protocols must increase it. If lengths change protocols must adapt. Etc. Difficult for multiple competing standards for things like namespaces to coexist.

Privacy

Off-chain: Queries can go through a data provider of ones choice (perhaps your own as well). Data files could be stored encrypted to a set of keys for those who should be able to decrypt/have 'authenticated' data servers for groups wanting to share metadata privately in-group.

On-Chain: Every query must go through horizon. Data entries in general can't be stored on the network encrypted. All writes are observed by all.

Censorship Resistance

Off-Chain: as censorship resistant as the internet at large.

On-Chain: as censorship resistant as the stellar network ( < internet).

MisterTicot commented 5 years ago

Once again on-chain vs off-chain is a limiting view because both solution have their own set of properties that can fit different use cases: this is why you couldn'd find a proper off-chain alternative for multisig boostrapping SEP on the mailing list.

Your analysis is interesting but if at that time you're still trying to prove that account data entries are useless it means you did not took in account the peer input you asked for in the first place.

JeremyRubin commented 5 years ago

I'm not sure what you mean by that. The formal process I'm trying to emulate here is something like:

1) Statement of goal 2) Proposed requirements/relevant properties 3) Agreement on requirements and relevant properties 4) Analysis of families of solution based on the requirements and properties 5) Agreement on which famil(ies) of solutions offers the best trade offs 6) Proposal of concrete solution in the agreed on families 7) Agreement that proposed solution meets goal.

I'm not trying to prove that data entries are useless, I'm trying to show that they don't meet the requirements/properties we'd like to see out of a data solution.

I'm unclear what it insinuates that I haven't taken into account peer input in this discussion. I used peer input to form my requirements and properties proposed in step 2 and the overall goal in step 1.

If you disagree with my conclusion in step 5 that off-chain solutions offer stronger benefits, then please extend the analysis in step 4.

If you don't have more to add at step 4, then I (or someone else who cares about this issue) will begin to draft something for step 6. The reason you haven't seen someone propose a proper alternative using off-chain data is that it hasn't been described yet -- protocol development takes time and there are limited engineering resources in general.

MisterTicot commented 5 years ago

When someone decide to put himself in lead of solving an issue, I'd expect him to understand, summarize and include peer inputs in its analysis. I hope you're not intentionally leaving out arguments that doesn't go your way, but unfortunately that's my impression.

In particular, Paul & I took the time to explain and demonstrate that account data entry have a set of required properties that we are currently unable to reproduce off-chain. This fact must figure in the issue analysis.

Also, as I repeatedly said, the present analysis has the bias of opposing two complementary solutions. The angle is to tell which is better between on-chain or off-chain solution - with a clear intention of ruling on-chain out. That leads to overlook the actual complexity of the subject. Depending on the situation, an application designer could choose one option or the other - or both. The issue analysis must account for that possibility.

For example step 4 is only right in case you want to publish data without using any other Stellar functionality. If you're building a dApp, you will use the Stellar API anyway, so off-chain solution comes with the burden of that having to use an external service on top of that. So that:

Then there are use cases where privacy/cost to read/cost to write/revocability/data size are not relevant. That's how you end up having some sort of applications that would be better handled with on-chain data.

As the analysis miss any real case scenario, it is easy to overlook elements that would prove the need for an on-chain data solution. In fact, trustlessly publishing off-chain data in relation to an account or a transaction require linking it somehow from the Ledger. Extending account configuration require on-chain data. And so on...

We also find a bias toward off-chain conclusion in the fact that what is compared is an ideal theoretical off-chain solution Vs. an admittedly flawed on-chain solution. It must either compare an actual off-chain solution with an actual on-chain solution, or an ideal on-chain solution with an ideal off-chain solution.

I sincerely welcome your effort to getting us toward a better understanding and handling of ledger-related data publishing. However, the general impression I'm getting from this first attempt is that it started from the conclusion (we must remove data entries), and that the analysis have been written accordingly to that goal.

To summarize, the problematic points I see on the proposed analysis are:

I'd like those points to be addressed this way:

pselden commented 5 years ago

Seems like a popular use case of Stellar is to attach IPFS hashes to data entries: https://galactictalk.org/d/433-stellar-should-have-a-big-memo-or-data

Note: some in the thread are arguing that we should go even further with data entries.

theaeolianmachine commented 5 years ago

In recent discussions, we've been talking about how IPFS can be a good means to store data on the blockchain in a way that allows more flexibility and less data on Stellar's chain (hopefully solving problems on both sides of this issue). Any contributions to the current PRs (#217, #218) and getting them in shape with the new process for a draft to discuss would be greatly appreciated.

JeremyRubin commented 5 years ago

I think it might be higher-impact to allow serving stellar.toml and other related files through IPFS.

For memos I think typically you want to have the data as available as the transaction itself (e.g., in @p2p you use the memo to see if the transaction was intended for you).

I'm also not sure why you'd want to store data entries at all if your home domain is an IPFS root that you update.

Pedity-Luffy commented 5 years ago

Hi @JeremyRubin , @theaeolianmachine and @MisterTicot Joining this discussion since this is a topic on which we are working a lot and seeing it depreciated will leave no scope for our project of projects that will use Data Entries. I want to share is the IPFS data properties and even the memo properties are quite useful and they serve a lot of purpose. Removing the Data Entries will now will essentially kill the scope of additions of any type of decentralized application like we have on stellar.

The application that has been discussed using this approach is used by us since October 2018 and examples are below - Decentralized blog - https://www.pedity.com/blog/GBD3ECXAO4427NFYIZH6TYSZVX2I76KVUHYYKJIQZUYC3GHA73KHGNNV https://www.pedity.com/article/6f7dddbe572649dc0d7c3d954ca18a23f9b8da5056eb8a71b04954d8fe6f99a9

Decentralized metadata/profile https://www.pedity.com/profile/GBD3ECXAO4427NFYIZH6TYSZVX2I76KVUHYYKJIQZUYC3GHA73KHGNNV

Decentralized fundraiser campaigns https://docs.pedity.com/concepts/#goals

There is a whole lot of possibility of future applications using Data Entries and I hope it is not removed for the sake of potential applications using Stellar.

JeremyRubin commented 5 years ago

Why can't this go into a single IPFS hash entry (e.g., under homedomain)

Pedity-Luffy commented 5 years ago

@JeremyRubin Putting a multihash in homedomain is not an issue if there is no restriction on homedomain field. Currently the Homedomain has a requirement that it should be in a format of fully qualified home domain.

JeremyRubin commented 5 years ago

A "requirement" that it "should".

Then it's not a requirement, put whatever you like there!

Pedity-Luffy commented 5 years ago

@JeremyRubin Interesting, if this works out than we don't have any problem with removal of data entries. I will suggest one more thing, homedomain should be renamed to something else like userdata maybe.