skale-me / node-parquet

NodeJS module to access apache parquet format files
Apache License 2.0
57 stars 11 forks source link

run node-parquet in AWS Lambda #20

Open taureliloome opened 7 years ago

taureliloome commented 7 years ago

Hi, I wanted to use this wonderful module in aws lambda, the key blocker is that when I compile node-parquet module then the whole thing is over 400MB; Unfortunately AWS Lambda allows to upload ~240 MB max per lambda function. I was wondering is there any possibility to slim the whole output down. Or is this is what we get? In any case I'm looking through make files to understand if I can do something on my own. Thanks for your time!

mvertes commented 7 years ago

It should be possible to be smaller than 400 MB

mvertes commented 7 years ago

Hi, can you check again, and run npm run clean after npm install ? It should clean most of the stuff necessary to build but useless at running.

alaister commented 6 years ago

Any luck getting it to work? I am getting the following error when trying to run in lambda:


{
  "errorMessage": "libboost_regex.so.1.62.0: cannot open shared object file: No such file or directory",
  "errorType": "Error",
  "stackTrace": [
    "Object.Module._extensions..node (module.js:597:18)",
    "Module.load (module.js:487:32)",
    "tryModuleLoad (module.js:446:12)",
    "Function.Module._load (module.js:438:3)",
    "Module.require (module.js:497:17)",
    "require (internal/module.js:20:19)",
    "Object.<anonymous> (/var/task/node_modules/node-parquet/index.js:5:17)",
    "Module._compile (module.js:570:32)",
    "Object.Module._extensions..js (module.js:579:10)"
  ]
}
aib-nick commented 6 years ago

@alaister make a 'lib' folder on your lambda function and copy that library in there.. that worked for me

fzaffarana commented 6 years ago

@aib-nick can you give me more information about your work with aws lambdas & the module?

I'm getting this error:

module initialization error: Error at Object.Module._extensions..node

i guess that it is the same @alaister error.

Thanks!

aib-nick commented 6 years ago

@fzaffarana this is my lambda application layout. As you can see I just made a lib directory and copied the missing library in there. I use AWS cloud9 for lambda development, so I got the library from there, and it works when deployed.

./myprogram
./myprogram/index.js
./lib
./lib/libboost_regex.so.1.53.0
./node_modules/node-parquet/...
... other modules installed with normal npm install  ...
./template.yaml
./.application.json

and then I just include and use stuff normally

I have successfully made parquet files on s3 with this by putting a function inside a kinesis stream as a transformation function, and then throwing away all the transformations.. so the lambda functions writes to s3, and the kinesis stream does not. It almost worked, but I got a few errors where kinesis aborted, and I couldn't really debug what was going on... ultimately I had to abandon this method because of time constraints. But it was very close. I was able to read the resulting files from athena.

// setup AWS access
const setRegion = "us-east-1";
const AWS = require('aws-sdk');
AWS.config.update({region: setRegion});

// setup s3 access
const s3 = new AWS.S3();

// parquet access
const parquet = require('node-parquet');
....

exports.handler = (event, context, callback) => {

...

    // schema for this parquet file
    const schema = { ... };

.... loop through input and build up out_data[]  ...

            var tmpobj = tmp.fileSync();    
            var writer = new parquet.ParquetWriter(tmpobj.name, schema, 'snappy');
            writer.write(out_data[k]);
            writer.close();

... write to s3 ...

            // give s3 the ability to read the local file and stream it
            var rs = fs.createReadStream(tmpobj.name);

            var s3_key = "parquet/stuff/year=" + moment(k).format('YYYY');
            s3_key = s3_key + "/month=" + moment(k).format('MM');
            s3_key = s3_key + "/day=" + moment(k).format('DD');
            s3_key = s3_key + "/" + invocationId + ".snappy.parquet";

...

 s3.putObject(s3_put_params, function(err, data) {

.. throw away records so kinesis doesn't write them after we wrote ok...

                // this tells kinesis to throw away all the records we saved otherwise
                output.push({
                    recordId: record.recordId,
                    result: 'Dropped'
                });

...

callback(null, { records: output });
fzaffarana commented 6 years ago

@aib-nick thank you first of all for the help.

I can see that we have similar lambdas (this is good). (i'm going to take your trick => 'give s3 the ability to read the local file and stream it').

But, i don't know if we have the same error.

this is mine (in aws console):

module initialization error: Error
at Object.Module._extensions..node (module.js:681:18)
at Module.load (module.js:565:32)
at tryModuleLoad (module.js:505:12)
at Function.Module._load (module.js:497:3)
at Module.require (module.js:596:17)
at require (internal/module.js:11:18)
at Object.<anonymous> (/var/task/src/project/classes/node-parquet/index.js:5:17)
at Module._compile (module.js:652:30)
at Object.Module._extensions..js (module.js:663:10)
at Module.load (module.js:565:32)
at tryModuleLoad (module.js:505:12)
at Function.Module._load (module.js:497:3)
at Module.require (module.js:596:17)
at require (internal/module.js:11:18)
at Object.<anonymous> (/var/task/src/project/classes/Tools.js:4:17)
at Module._compile (module.js:652:30)

It doesn't show any specific lib missing. On the other hand, when i test this lambda in my local environment, it works correctly.

visuddha commented 6 years ago

This would be a useful feature!

palafoxernesto commented 6 years ago

Is there any fix?

dogenius01 commented 6 years ago

@aib-nick Hi, Could you list the lib file? I'm running lambda to see every lib errors... :( Please help..~~

mikeytag commented 6 years ago

It's been a while since this question was originally asked, but I wanted to followup and see if anyone has a tried and true way of doing the npm install/adding lib files that always works to get node-parquet working on Lambda?

I'm about to embark on this task and would love to hear the wisdom of others as far as any gotchas.

paflopes commented 4 years ago

I've managed to run node-parquet on AWS Lambda version NodeJS 10.x, I think it's worth mentioning that I couldn't build it on newer NodeJS versions. You'll also need Docker installed on your machine. The steps are the following:

Run this in the root folder of your project

$ docker run --rm -it -v "$PWD":/var/task lambci/lambda:build-nodejs10.x /bin/bash

This will give you an environment similar to the AWS Lambda.

Inside the container run the following commands:

# First we update the cmake version since this image comes with the version 2
cmake_name="cmake-3.16.1-Linux-x86_64"
cmake_tar="${cmake_name}.tar.gz"
curl -L https://github.com/Kitware/CMake/releases/download/v3.16.1/${cmake_tar} -o /opt/${cmake_tar}
mkdir -p /opt/${cmake_name}
tar xf /opt/${cmake_tar} -C /opt
chmod a+x /opt/${cmake_name}/bin/cmake
mv /bin/cmake /bin/cmake.bkp
ln -s /opt/${cmake_name}/bin/cmake /bin/cmake

# Now we install the last dependencies and build the project
yum install -y boost-devel bison flex
npm install

# Cleanup dependencies so we can actually deploy to AWS Lambda
rm -Rf ./node_modules/node-parquet/build_deps

I hope this helps!

dreadjr commented 4 years ago

I have done similar to what @paflopes describes and putting that into a layer which the application can use.