ghdna / athena-express

Athena-Express can simplify executing SQL queries in Amazon Athena AND fetching cleaned-up JSON results in the same synchronous or asynchronous request - well suited for web applications.
https://www.npmjs.com/package/athena-express
MIT License
181 stars 70 forks source link

Handle NextToken for large returned data ? #20

Closed phongthienvu closed 4 years ago

phongthienvu commented 4 years ago

In a case of large returned data (100k rows), as I understand, the data may be truncated in the first call (1000 rows) + a NextToken field. I just wonder if the 'athena-express' handles this case? If not, any workaround for this case?

ghdna commented 4 years ago

Yeah it will query all rows and store it in S3. What you can do is instead of having athena-express retrieve all results, just setup skipResults: true in configuration. This will return the S3 location that you can access directly using SDK/CLI and paginate the results as you wish.

phongthienvu commented 4 years ago

Hi ghdna, I thought my question is not clear enough. Anyway, I am still learning this Athena, so I may be wrong, please correct me if any. As I read through the code, I saw you get the query result from saved S3 files(.csv). But from AWS example (https://docs.aws.amazon.com/code-samples/latest/catalog/javascript-athena-index.js.html), they are using getQueryResults to get the result. And in this method, as AWS SDK doc(https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Athena.html#getQueryResults-property), it returns a NextToken field (if the result is larger than a limit which I still don't know exactly how many). Therefore, if the code only bases on S3 files to get query results, I think it may cause missing data.

ghdna commented 4 years ago

So the AWS documentation you shared uses getQueryResults method which restricts to 1000 records from S3 and provides a next token for every 1000 records - which can be time consuming and require multiple calls to S3.

Therefore in athena-express, I'm not using this method. Instead athena-express reads a stream of data from S3 so you can fetch all the records (even >1000) if you'd like. However, if you don't want all the records, then you can skip this step by configuring skipResults: true and simply have athena-express store the results in S3 for you that you can retrieve later using your own logic in whatever method you see fit.