mtth / avsc

Avro for JavaScript :zap:
MIT License
1.27k stars 147 forks source link

discussion/question: createFileDecoder never gets any data #427

Closed cstmgl closed 1 year ago

cstmgl commented 1 year ago

Hi all, this is not a bug at all but I could not find a Q&A or discussion section.

Honestly I could not find how to make it work after a couple of hours searching in this other similar package https://www.npmjs.com/package/avro-js https://github.com/elevy30/bigdata-playground/blob/463bf854f4f291a51e55fcb8e1710b6944ccb8a3/spring-boot/spring-boot-avro/src/main/resources/node_avro.js#L23 but basically I end up having always a similar problem.

I'm trying to do something like this: I have a avro file in the file system (or in s3) I thought this example from the tests would be similar https://github.com/mtth/avsc/blob/master/test/test_index.js#L70

I can get the events I expect on the "metadata" and the schema of the file is exactly what I expect. but either error, data or end never get triggered.

      .on('metadata', function (type: any) {
        console.log('on metadata');
        console.log(type);
      })
      .on('error', function (err: any) {
        console.log('on error');
        console.log(err);
      })
      .on('data', function (record: any) {
        console.log('on data');
        console.log(record);
      })
      .on('end', function () {
        console.log('on end');
      });

Is there a simple example of how to read and parse an avro file from the filesystem? I do not have the "schema" upfront but it actually can be retrieved from the avro file itself but most of the examples I found are always based on having the .avsc file upfront.

Thanks for any feedback and sorry if this is very basic.

The file in question they do have content because if I use an online tool for example I can see the content of the file or using for example a python library

I'm now sure the problem is that my file is using snappy compression and even if I use the snappy codec somehow it seems no event is ever emitted (only the metadata) Is there any example available using a compressed file? I could not find one in the repo.

Might be similar to this issue https://github.com/mtth/avsc/issues/352 or https://github.com/mtth/avsc/issues/100 but I did not get around it by using end either

Anyway I put a sample file and what I've been trying here. https://github.com/mtth/avsc/compare/master...cstmgl:avsc:test-load-file

Next I'll probably try to do it as a stream buffer instead of reading the file, seems there are more examples of those.

jacospain commented 1 year ago

I head been running into a similar issue (no events after metadata, whether I included the snappy codec or not) on a snappy-compressed object file. I was copying the snappy example from the readme (and I tried some other snappy code in github issues).

I noticed that snappy's uncompress method appears to no longer take a callback. It returns a promise. I'm getting events now that I am interfacing with snappy properly.

cstmgl commented 1 year ago

Thank you for your support already gives me some ideas to try. I noticed that now they have also a uncompressSync

Maybe I'll try that https://www.npmjs.com/package/snappy

I feel really bad about this because I was so focused on the avro part that I never thought that something might have changed in the snappy, which makes more sense since I've tried both the avsc and avro-js and had the same problem because likely the snappy changed in a way that I didn't thought to check.

cstmgl commented 1 year ago

I head been running into a similar issue (no events after metadata, whether I included the snappy codec or not) on a snappy-compressed object file. I was copying the snappy example from the readme (and I tried some other snappy code in github issues).

I noticed that snappy's uncompress method appears to no longer take a callback. It returns a promise. I'm getting events now that I am interfacing with snappy properly.

out of curiosity how does your codes look like, I tried both with the sync and the normal "uncompress" and still get nothing.

  const codecs = {
      snappy: async function (buf: any, cb: Function):Promise<string | Buffer> {
          // Avro appends checksums to compressed blocks, which we skip here.
          const buffer: string | Buffer = await uncompress(buf.slice(0, buf.length - 4));
          return buffer;
      }
  };

or

  const codecs = {
      snappy: function (buf: any, cb: Function):Promise<string | Buffer> {
          // Avro appends checksums to compressed blocks, which we skip here.
         return uncompressSync(buf.slice(0, buf.length - 4));
      }
  };

if I print some output I can see that the buffer does change:

buffer original length is 114
buffer original value is <Buffer 76 b8 02 c2 9a 08 02 a6 03 02 08 96 c3 ad c2 ae 61 02 0a 36 37 32 33 33 02 02 16 01 46 66 36 37 37 65 63 62 30 64 30 37 34 31 34 35 64 2d 66 36 37 36 ... 64 more bytes>
uncompress length is 118
uncompress  value is <Buffer 02 c2 9a 08 02 a6 03 02 08 96 c3 ad c2 ae 61 02 0a 36 37 32 33 33 02 02 16 01 46 66 36 37 37 65 63 62 30 64 30 37 34 31 34 35 64 2d 66 36 37 37 65 63 ... 68 more bytes>

but still I get no events, I must be doing something really wrong

jacospain commented 1 year ago

Snappy doesn't take a callback, but I believe avsc still requires you to call its callback with the result of the decompression. So, instead of returning the buffer, be sure to call cb(buffer) or cb(uncompressedSync(...)).

mtth commented 1 year ago

Thanks for reporting @cstmgl and for helping out @jacospain. I'll update the documentation to omit the outdated snappy examples.

cstmgl commented 1 year ago

@mtth thanks for the feedback, I would rather take a different approach then removing snappy from the documentation. I have not given any update here because I'm having some issues with snappy, not so much this library anymore. But I think it would be cool to have a working example with either snappy or deflated because I think using a codec is a fair assumption to have so it would be nice to have.

In my particular case it's just that the files are generated by an application out of my control and I can't just disable the snappy on them. But ideally I would like eventually to provide a sample PR with a documentation example of snappy and deflated if someone ever faces the same issue as me. I could provide also a separate PR of the examples with snappy and deflated and some files with it, but at the moment I'm still trying to make the snappy work.

by the way just for clarifying I'm really not that of an expert in javascript so it's totally possible, even likely that the issue is on my side.

anyway this is where I'm at now: the example of the deflate is here: https://github.com/cstmgl/avsc/blob/test-load-file/test/test_index.js#L83 fails with:

       createFileDecoderDeflate:
     Uncaught Error: invalid union index: -11
      at UnwrappedUnionType._read (lib/types.js:1318:11)
      at RecordType.readPerson [as _read] (eval at RecordType._createReader (lib/types.js:2296:10), <anonymous>:6:8)
      at BlockDecoder._readValue (lib/containers.js:627:47)
      at BlockDecoder._read (lib/containers.js:305:16)
      at Deflate.cb (lib/containers.js:273:14)
      at Deflate.zlibBufferOnEnd (node:zlib:161:10)
      at Deflate.emit (node:events:513:28)
      at endReadableNT (node:internal/streams/readable:1358:12)
      at processTicksAndRejections (node:internal/process/task_queues:83:21)

and the example for snappy is here: https://github.com/cstmgl/avsc/blob/test-load-file/test/test_index.js#L109 fails with:

       createFileDecoderSnappy:
     Uncaught Error: snappy codec decompression error
      at /workspaces/avsc/lib/containers.js:265:17
      at BlockDecoder.snappy [as _decompress] (test/test_index.js:118:16)
      at BlockDecoder._writeChunk (lib/containers.js:245:10)
      at BlockDecoder._write (lib/containers.js:229:8)
      at writeOrBuffer (node:internal/streams/writable:391:12)
      at _write (node:internal/streams/writable:332:10)
      at BlockDecoder.Writable.write (node:internal/streams/writable:336:10)
      at ReadStream.ondata (node:internal/streams/readable:754:22)
      at ReadStream.emit (node:events:513:28)
      at addChunk (node:internal/streams/readable:315:12)
      at readableAddChunk (node:internal/streams/readable:289:9)
      at ReadStream.Readable.push (node:internal/streams/readable:228:10)
      at node:internal/fs/streams:279:14
      at FSReqCallback.wrapper [as oncomplete] (node:fs:671:5)
mtth commented 1 year ago

Hi @cstmgl. There is still a (now updated) example including Snappy, linked from the README.

W.r.t. your test using Snappy, I think you need to update this line to cb(null, buffer). Callbacks take an error as first argument by convention.

cstmgl commented 1 year ago

I wanted to just come here and confirm, for whoever gets similar error that this is now fixed.