linkedin / goavro

Apache License 2.0
971 stars 216 forks source link

Getting a single binary-encoded object [OCFReader.ReadBinary() function?] #108

Open floren opened 6 years ago

floren commented 6 years ago

I'd like to be able to take an OCF as input, then pull out individual binary-encoded objects one at a time. I will store the schema separately and later apply it to these individual objects using Codec.NativeFromBinary.

I can currently use OCFReader.Scan/Read and Codec.BinaryFromNative to get a single binary-encoded object, but the call to OCFReader.Read invokes Codec.NativeFromBinary. Because performance is an important consideration in this, I'd like to be able to skip those redundant decoding/encoding calls.

Ideally, something like OCFReader.ReadBinary() would return a []byte containing exactly one Avro object. I'm trying to implement it, but being unfamiliar with goavro I'm having a hard time figuring out if I can generically pull out the slice of bytes representing a single object without basically reimplementing nativeFromBinary functions for each Avro type. Any suggestions?

floren commented 6 years ago

Here's a hacky, apparently-functional implementation of ReadBinary, but it still calls NativeFromBinary because I haven't yet figured out how to figure out the length of an object without essentially reimplementing portions of the nativeFromBinary functions of each individual object:

func (ocfr *OCFReader) ReadBinary() ([]byte, error) {
    // NOTE: Test previous error before testing readReady to prevent overwriting
    // previous error.
    if ocfr.rerr != nil {
        return nil, ocfr.rerr
    }
    if !ocfr.readReady {
        ocfr.rerr = errors.New("Read called without successful Scan")
        return nil, ocfr.rerr
    }
    ocfr.readReady = false

    olen := len(ocfr.block)

    // We do this to consume the next entry...
    var nblock []byte
    _, nblock, ocfr.rerr = ocfr.header.codec.NativeFromBinary(ocfr.block)
    if ocfr.rerr != nil {
        return nil, ocfr.rerr
    }
    ret := make([]byte, olen-len(ocfr.block))
    copy(ret, ocfr.block)

    ocfr.block = nblock

    ocfr.remainingBlockItems--

    return ret, nil
}

It does provide approximately a 50% improvement for my use case, but it's still got redundancies.

karrick commented 6 years ago

This is a very sensible feature.

bruth commented 5 years ago

Any more consideration for adding support for this method?

I have a use case requiring an efficient way to scan/skip records. I will be maintaining an external index file from a field value to the record offset. So ideally, I would like a way to seek to specific record offsets in order to pull out and materialize the specific records.

This method will reduce the overhead of performing a full table scan since records can be skipped without the overhead of unmarshalling them. A sequential scan using this method along with a way to split up an Avro file to be processed in parallel (if the file is particularly large), would likely be the ideal approach for this use case (as supposed to random access across blocks).

Any thoughts or guidance on this use case for would be appreciated!

bruth commented 5 years ago

Peaking through the codec code, it seems to get the best performance, the codec type needs to be extended to support reading the raw bytes without any decoding (which the @floren's code example still does). This means all of the functions that create a codec need to be updated to support this.

karrick commented 5 years ago

Each piece of data stored in Avro does not have a header that specifies how many bytes it consumes. That would be cool if it did, and it would allow us to provide a means of addressing this issue, but also several others surrounding schema evolutions.

The closest thing that we get is the OCF format does have a block header that specifies how many bytes are used in each block, so a OCF reader can scan and skip B blocks before it starts decoding block B+1 to find item N.

kmulvey commented 5 years ago

naive question: putting performance aside for a second could you just add ocfr.block to the list of return items?

karrick commented 5 years ago

@kmulvey , I have added the 108 branch as a strawman change so we can toss back and forth ideas about a proper API.

https://github.com/linkedin/goavro/tree/108

tkonya commented 3 years ago

@karrick I think the changes in the 108 branch would resolve an issue for me. What is the process for this getting merged into master?