ledgerwatch / erigon-lib

Dependencies of Erigon project, rewritten from scratch and licensed under Apache 2.0
Apache License 2.0
59 stars 94 forks source link

ETL load does not support 0 value keys #766

Open elee1766 opened 1 year ago

elee1766 commented 1 year ago

first time using etl just now- I was scratching my head over why Load was not working, it was because i was trying to load empty values, possible with direct transaction, but seems currently seems in ETL framework it is used as the delete identifier.

https://github.com/ledgerwatch/erigon-lib/blob/main/etl/collector.go#L239

One possible solution - add TransformArg "AllowEmptyValues" to skip this check, and allow therefore "zero values"? this is something supported by MDBX. Happy to make a PR if it seems like a good idea.

example use case, filling a database with a large amount of random empty keys for performance benchmark:

func (d *Database) AddKeys(ctx context.Context, bucket string, bytesize int, keysize int) error {
    src := frand.New()
    cl := etl.NewCollector("mdbxbench", os.TempDir(), etl.NewOldestEntryBuffer(etl.BufferOptimalSize*5))
    for i := 0; i < (bytesize / keysize); i++ {
        o := make([]byte, keysize)
        src.Read(o)
        err := cl.Collect(o[:], []byte{})
        if err != nil {
            return err
        }
    }
    defer cl.Close()
    err := d.DB().Update(ctx, func(tx kv.RwTx) error {
        return cl.Load(tx, bucket, func(k, v []byte, _ etl.CurrentTableReader, next etl.LoadNextFunc) error {
            return next(k, k, []byte{}) /// it would be good if this actually added the entry, instead of deleting. For now i am just doing []byte{0}
        }, etl.TransformArgs{})
    })
    if err != nil {
        return err
    }
    return nil
}
AskAlexSharov commented 1 year ago

It's maybe-big task. Next things needs to be done:

elee1766 commented 1 year ago

Hm, I will take one step at a time. and update this post as I investigate

* check that mdbx-go bindings do support empty values - maybe not (because LMDB didn't support them). Sub-tasks - can store `empty value` in: value, key, value of DupSort table.

it seems https://github.com/torquem-ch/mdbx-go/blob/master/mdbx/txn.go#L548 does not support empty values. replaces 0 length with a []byte{0}, then passes to Put. it seems in MDBX - reserve.iov_len = (data ? data->iov_len : 0) + sizeof(mdbx_attr_t); is used, implying that if NULL is passed as data, that would be how to set an empty value. Assuming key is the same

* check that kv_mdbx.go does support empty values

seems it passes through https://github.com/ledgerwatch/erigon-lib/blob/main/kv/mdbx/kv_mdbx.go#L1109, https://github.com/ledgerwatch/erigon-lib/blob/main/kv/mdbx/kv_mdbx.go#L1374 all the way down, so it does support empty values

as for empty keys, seems it is properly handling it, at least here: https://github.com/ledgerwatch/erigon-lib/blob/main/kv/mdbx/kv_mdbx.go#L535. tests will need to be written

TODO:

* check that kv_remote and remotedbserver do support empty values

* ensure that `empty value` it's not `nil`. That all API's (Seek, SeekBothRange, Get, Next, First, Last, etc...)- if you put `empty value` then return `empty value` not `nil`. likely need write tests for this special case - because it's easy to break.

* etl has 3 collectors all of them need support `empty value`

* etl now implemented next way: `nil`-value means delete this key. Need keep this feature. etl doesn't support deletes from DupSort tables.

* take a look into `rawdb` package - maybe there are checks like `if len(k) == 0`: `nil` and `empty value` will pass this check.
AskAlexSharov commented 1 year ago

FYI: you can ignore b.AutoDupSortKeysConversion == true case