Serialization - Githubissues

adria0 commented 1 month ago

from @davidnevadoc

In halo2curves, the serialization of fields could be improved. There are methods that support different: Endianness Limbs, Bytes, bits Motgomery form, non-montgomery form Some are implemented through traits like SerdeObject others are implemented in PrimeField, others are implemented directly on the field. It would be great to remove redundancies, estabilish a clear name convention and document the end result.

Related issue: #109

First round, initial idea

Halo2curves serializations comes with two flavours:

Standard serialization
- API
- Implement trait Field.{from,to}_repr fn
- Implement trait group::GroupEncoding (for compressed), and group::UncompresseEncoding (for uncompressed), allowing "unchecked" reads.
- To match the serialization "standards", needs predefined format, configurable via macro at field/curve definition, not configurable when serialize/deserialize:
  - Fields: field endian in field_impl! macro
  - Curves: impl EndianRepr for endianess, impl Compressed for flags and to uncompress.
- Since repr does not have to match with encoding endianess, fields implements {from,to}_bytes(). Let's say that since we are using PrimeFields, we do not need checked reads here because this is only checking that fits inside modulos.
- Implements serde wrappers over group:xxxxEncodings
- Note that allows hex encodings (::hex::serde::deserialize(deserializer)) and I'm not very sure if this could break constant-time without notice
Vainilla "raw" serialization
- API
- fn {from, to}_raw_bytes_[unchecked]
- fn {read,write}_raw_[unchecked]
- Quick, just little endian, no special flags.
- Implememented via SerdeObject trait (which is not serde) at Field, Curve and tower
Others

Notes:

I really would like to get rid of EndianRepr but it seems like some kind of type-hack to allow to knowing the endiannes of CurveAffine::BASE when encoding compressing curves. Since our halo2curves takes traits from pasta, we are not able to modify it.
There are other ways to build fields:
- Field.from_raw() is not designed for serialization, and it seems using this name. A more rational name will be from_integer IMO, but we cannot change this because from_raw is used in a derive calling pasta_curves at this moment.
- from_uniform_bytes for random ones
- from_mont for montgomery

Unserialize performance:

legend:

cur / afi : Curve / Affine
chk / unchk : Checked / Unchecked
comp / uncomp : Compressed / Uncompressed

code: https://gist.github.com/adria0/c440185d765a368aaf21ca5741a63ab7

Initial proposal:

Remove SerdeObject. Why we need this? If we want to quick process data between trusted parties we have UncompresseEncoding::from_uncompressed_unchecked (that is in the order of magnitude of SerdeObject decodings), if not we can use GroupEncoding::from_bytes that fits into standards. Also having a mix of endianess if we use one or another API just creates confusion IMO. The memory footprint that would save by using {read,write}_raw will be really low, so even if we decide to enter into zerocopy stuff, r/w traits here are not a significant improvment.
IMO Arkworks is the standard of interoperability at this moment, we must encode everything as this guys do.
Implement serde for Field, is not implemented
EndianRepr::from_bytes, EndianRepr::to_bytes is implemented in Field (that aready have these fields). This is extra confusing, these are just intended to be used internally as helpers. So rename this fields for something like endianrepr_{from,to}_bytes

Comments: @davidnevadoc @duguorong009 @kilic ?

guorong009 commented 1 month ago

@adria0 First of all, thx for all research and proposal!

Here is my idea about your proposal:

Remove SerdeObject

Why we need this? If we want to quick process data between trusted parties we have UncompresseEncoding::from_uncompressed_unchecked (that is in the order of magnitude of SerdeObject decodings), if not we can use GroupEncoding::from_bytes that fits into standards.

I think this SerdeObject is for field stuff, while UncompressedEncoding is for curve one. https://github.com/duguorong009/halo2curves/blob/b753a832e92d5c86c5c997327a9cf9de86a18851/derive/src/field/mod.rs#L485 Maybe, they do not bring any meaningful difference in underlying implementation. But, I think it's odd to implement UncompressedEncoding trait for any field.

Follow Arkworks standard I think @davidnevadoc did some research on serialization in Arkworks, in previous works.
Implement serde for Field, is not implemented I think SerdeObject is for this purpose. They help to convert field to bytes and vice versa.
Rename EndianRepr::[from, to]_bytes Do we need this renaming? As you said, these are for internal use.

Overall, I think we first need documentation. The documentation, I mean, is both of one explaining all current traits(role, use, ...), and one of planning how to improve.

davidnevadoc commented 1 month ago

Thanks @adria0 for the in depth description of the issue!

I think we should tackle fields and curve serialization separately:

For curves:

GroupEncoding and UncompressedEncoding are traits meant for curves (groups), not for fields.
Curve point serialization is, with the exception of compression, little more than serializing field elements. So choices such as endianness should be made at the field level, and the curves just inherit them.

For fields, let's stick to prime fields only for now ( field extensions can be represented in multiple ways and that is a bigger problem that exceeds the scope of this issue).

I agree with @adria0 point on SerdeObject. I think it is one of the main points of redundancy and it may need to be removed. In my view, its name makes it easy to confuse with the usual serde::Serialize and serde::Deserialize traits. Then, the doc that explains its functionality states:
```
/// Trait for converting raw bytes to/from the internal representation of a type.
/// For example, field elements are represented in Montgomery form and serialized/deserialized without Montgomery reduction.
```
I can't see why this kind of access needs to be exposed publicly. Specially when Montgomery/non-Montgomery form introduces one of the main sources of confusion when it comes to field representation.
I also agree with point made about from_raw. Unfortunately this is not a clear name, raw could mean anything. Is it raw bytes? raw limbs/ u64s? Montgomery or non-Montgomery?. I think just the opposite about from_uniform_bytes. I think it is crystal-clear that the input for this function is a series of random bytes and therefore, it has to be transformed into a field element w/o validity checks.
from_mont and to_mont look to me like good names. (perhaps replacing SerdeObject?)

Lastly, I agree 100% with @guorong009 remark about documentation. Whichever design we land on, we should make an effort to have it well documented to prevent more confusion.

adria0 commented 1 month ago

Round two,

proposed changes

Remove SerdeObject

documentation

Field Encodings

from_bytes/to_bytes: They use an industry-standard format that is consistent with how curves are encoded. This format is what will be used internally by the Serde library to ensure interoperability. Provides a unified format for both field and curve serialization. Ensures a consistent, industry-standard serialization, using big or little endian depending on the curve
from_mont/to_mont: These methods convert elements to and from the Montgomery form, which is an internal representation that is commonly used for efficient field arithmetic. Use these when working specifically with Montgomery-reduced values, especially in cryptographic computations.
from_raw: Creates a field element from a raw integer (typically limbs or u64s). Use this method when directly converting an integer value into a field element.
from_uniform_bytes: Converts a uniform random byte array into a valid field element. This is particularly useful in scenarios requiring a random element in the field, such as in cryptographic protocols or when hashing to the field.

Curve Encodings

GroupEncoding trait methods: Implements the serialization and deserialization of curve points in a compressed format. Compression is slower but generates standardized encodings that are smaller in size. Suitable for storing and transmitting curve points efficiently. Serde will use this.
UncompressedEncoding trait methods: Provides faster serialization/deserialization of curve points in an uncompressed format. The output is larger than that of GroupEncoding, but it's quicker to generate. When speed is prioritized over size.

Notes:

from_bytes, to_bytes from EndianRepr trait is only intended for internal use only. Do not use.

adria0 commented 1 month ago

davidnevadoc commented 1 month ago

I tried to new version of the serialization where SerdeObject is removed in halo2, and found out that there are more serialization traits! SerdePrimeField for example, implemented in Halo2... Concretely, we should decide what to do with the parts of serialization implemented in halo2 (see https://github.com/privacy-scaling-explorations/halo2/blob/main/halo2_backend/src/helpers.rs) Should we bring this kind of functionality to halo2curves ?

adria0 commented 1 month ago

I tried to new version of the serialization where SerdeObject is removed in halo2, and found out that there are more serialization traits! SerdePrimeField for example, implemented in Halo2... Concretely, we should decide what to do with the parts of serialization implemented in halo2 (see https://github.com/privacy-scaling-explorations/halo2/blob/main/halo2_backend/src/helpers.rs) Should we bring this kind of functionality to halo2curves ?

I think that implementing this helper in halo2curves, just adds more confusion, a library only has to provide one way to do the same thing.

I also question, why does this helper exists? It's just a simple wrapper. Exists maybe because halo2curves serialization was not really documented and was a way for implementer to clarify what can be done in halo2curves? I do not see this usefull in Halo2 in terms of cryptographic protocol.

guorong009 commented 1 month ago

@adria0 @davidnevadoc I've done some research why these helpers SerdeFormat(halo2), SerdeObject(halo2curves) are there. The SerdeObject trait was introduced in this PR. https://github.com/privacy-scaling-explorations/halo2curves/commit/5bcc8914a4d1d75901b62eeac3c4347f826f09d0 The SerdeFormat was introduced in this PR. https://github.com/privacy-scaling-explorations/halo2/pull/111 All of work was done by @jonathanpwang . Before that, there was no concept of Serde* in the halo2 and halo2curves. For example, the zcash/halo2 has only one option - CurveRead(compressed). https://github.com/zcash/halo2/blob/main/halo2_proofs/src/helpers.rs

I believe we should check why @jonathanpwang added these. If those traits & helpers do not have specific purpose we don't know, we can remove them, as @adria0 mentioned.

jonathanpwang commented 1 month ago

The reason they were added is that previously the default in halo2 and halo2curves was that serde always serialized and deserialized to the human-readable canonical representation of a BigInt as some u64 digits. However the internal struct stores the BigInt primes in Montgomery form. Therefore the serialization and deserialization was doing Montgomery reduction each time, which is not desirable if the serialization is being used for raw storage purposes that don't need human readability. For halo2 this was particularly relevant for proving keys.

So Serde* was created as a way to offer these separate "raw" (de)serde methods without interfering with the existing human readable serde methods.

I haven't been up to date on the current situation, so I can't tell if they are needed anymore, but this was the original motivation.

Edit: SerdeObject can be safely removed if you are able to define from_mont and to_mont somewhere else. The Serde* traits in halo2 was purely a way to extend traits beyond ff::Field. SerdeFormat is an enum to toggle between the different serialization methods.

privacy-scaling-explorations / halo2curves

Serialization #176

proposed changes

documentation