The response json returns the empty columns

deliverymanager commented 5 years ago

When there are some columns in a row that are empty / null, then the response json that is returned from the query is having empty indexes.

I think you should create an array with all falsey values removed. So that only the columns that actually exist in a specific row will be returned in the response.

{ resource: 'dynamo',
       name: 'customers",,,,"mpla",,,"table',
       bucket_name: 'fetchCustomer',
       directory_path: 'production' }

deliverymanager commented 5 years ago

Actually this cause the json to have wrong data.

ghdna commented 5 years ago

Good use case. Let me look into it. Could you please share the SQL query you are invoking and the sample data that you have in Athena - so I can replicate?

deliverymanager commented 5 years ago

The query is pretty simple "SELECT * FROM db"

That is the csv as it is saved after the query.

resource,"name","bucket_name","directory_path","file_name","comments","path_name","function_name","index_name","lambda","branch"
--
module,"util",,,,"mpla   mpla   mpla",,"login_mm",,"fetchCustomer","production"
dynamo,"customers",,,,"mpla",,,"table","fetchCustomer","production"
dynamo,"storeCustomers",,,,"mpla",,,"table","fetchCustomer","production"
athena,,,,,,,,,"format_athena_table","master"
s3,,"reporesources","","resources.json","Uploads   the resources.gz for the specific repo that comes from the   webhook",,,,"resources_github_webhook","master"
api,"api.github.com",,,,"We   use the api of github to get the resources.json file of each repo that comes   from the   webhook","getContents",,,"resources_github_webhook","master"

and that is what the json should be when I made a csv to json conversion at an online tool

[
  {
    "resource": "module",
    "name": "util",
    "bucket_name": "",
    "directory_path": "",
    "file_name": "",
    "comments": "mpla mpla mpla",
    "path_name": "",
    "function_name": "login_mm",
    "index_name": "",
    "lambda": "fetchCustomer",
    "branch": "production"
  },
  {
    "resource": "dynamo",
    "name": "customers",
    "bucket_name": "",
    "directory_path": "",
    "file_name": "",
    "comments": "mpla",
    "path_name": "",
    "function_name": "",
    "index_name": "table",
    "lambda": "fetchCustomer",
    "branch": "production"
  },
  {
    "resource": "dynamo",
    "name": "storeCustomers",
    "bucket_name": "",
    "directory_path": "",
    "file_name": "",
    "comments": "mpla",
    "path_name": "",
    "function_name": "",
    "index_name": "table",
    "lambda": "fetchCustomer",
    "branch": "production"
  },
  {
    "resource": "athena",
    "name": "",
    "bucket_name": "",
    "directory_path": "",
    "file_name": "",
    "comments": "",
    "path_name": "",
    "function_name": "",
    "index_name": "",
    "lambda": "format_athena_table",
    "branch": "master"
  },
  {
    "resource": "s3",
    "name": "",
    "bucket_name": "reporesources",
    "directory_path": "",
    "file_name": "resources.json",
    "comments": "Uploads the resources.gz for the specific repo that comes from the webhook",
    "path_name": "",
    "function_name": "",
    "index_name": "",
    "lambda": "resources_github_webhook",
    "branch": "master"
  },
  {
    "resource": "api",
    "name": "api.github.com",
    "bucket_name": "",
    "directory_path": "",
    "file_name": "",
    "comments": "We use the api of github to get the resources.json file of each repo that comes from the webhook",
    "path_name": "getContents",
    "function_name": "",
    "index_name": "",
    "lambda": "resources_github_webhook",
    "branch": "master"
  }
]

and below is what it is exported by the athena-express module

[
    {
      "resource": "module",
      "name": "util\",,,,\"mpla mpla mpla\",,\"login_mm\",,\"fetchCustomer",
      "bucket_name": "production"
    },
    {
      "resource": "dynamo",
      "name": "customers\",,,,\"mpla\",,,\"table",
      "bucket_name": "fetchCustomer",
      "directory_path": "production"
    },
    {
      "resource": "dynamo",
      "name": "storeCustomers\",,,,\"mpla\",,,\"table",
      "bucket_name": "fetchCustomer",
      "directory_path": "production"
    },
    {
      "resource": "s3\",,\"reporesources",
      "name": "",
      "bucket_name": "resources.json",
      "directory_path": "Uploads the resources.gz for the specific repo that comes from the webhook\",,,,\"resources_github_webhook",
      "file_name": "master"
    },
    {
      "resource": "api",
      "name": "api.github.com\",,,,\"We use the api of github to get the resources.json file of each repo that comes from the webhook",
      "bucket_name": "getContents\",,,\"resources_github_webhook",
      "directory_path": "master"
    },
    {
      "resource": "athena\",,,,,,,,,\"format_athena_table",
      "name": "master"
    }
  ]

ghdna commented 5 years ago

Which version of athena-express are you using? I just loaded your csv into Athena and queried select * and got this:

{
    Items: [{
        resource: 'module',
        name: 'util',
        bucket_name: '',
        directory_path: '',
        file_name: '',
        comments: 'mpla mpla mpla',
        path_name: '',
        function_name: 'login_mm',
        index_name: '',
        lambda: 'fetchCustomer',
        branch: 'production'
    }, {
        resource: 'dynamo',
        name: 'customers',
        bucket_name: '',
        directory_path: '',
        file_name: '',
        comments: 'mpla',
        path_name: '',
        function_name: '',
        index_name: 'table',
        lambda: 'fetchCustomer',
        branch: 'production'
    }, {
        resource: 'dynamo',
        name: 'storeCustomers',
        bucket_name: '',
        directory_path: '',
        file_name: '',
        comments: 'mpla',
        path_name: '',
        function_name: '',
        index_name: 'table',
        lambda: 'fetchCustomer',
        branch: 'production'
    }, {
        resource: 'athena',
        name: '',
        bucket_name: '',
        directory_path: '',
        file_name: '',
        comments: '',
        path_name: '',
        function_name: '',
        index_name: '',
        lambda: 'format_athena_table',
        branch: 'master'
    }, {
        resource: 's3',
        name: '',
        bucket_name: 'reporesources',
        directory_path: '',
        file_name: 'resources.json',
        comments: 'Uploads the resources.gz for the specific repo that comes from the webhook',
        path_name: '',
        function_name: '',
        index_name: '',
        lambda: 'resources_github_webhook',
        branch: 'master'
    }, {
        resource: 'api',
        name: 'api.github.com',
        bucket_name: '',
        directory_path: '',
        file_name: '',
        comments: 'We use the api of github to get the resources.json file of each repo that comes from the webhook',
        path_name: 'getContents',
        function_name: '',
        index_name: '',
        lambda: 'resources_github_webhook',
        branch: 'master'
    }]
}

ghdna commented 5 years ago

And I just pushed 3.1.0 that removes empty column values from final JSON. So the above JSON response looks like this now:

{
    Items: [{
        resource: 'module',
        name: 'util',
        comments: 'mpla mpla mpla',
        function_name: 'login_mm',
        lambda: 'fetchCustomer',
        branch: 'production'
    }, {
        resource: 'dynamo',
        name: 'customers',
        comments: 'mpla',
        index_name: 'table',
        lambda: 'fetchCustomer',
        branch: 'production'
    }, {
        resource: 'dynamo',
        name: 'storeCustomers',
        comments: 'mpla',
        index_name: 'table',
        lambda: 'fetchCustomer',
        branch: 'production'
    }, {
        resource: 'athena',
        lambda: 'format_athena_table',
        branch: 'master'
    }, {
        resource: 's3',
        bucket_name: 'reporesources',
        file_name: 'resources.json',
        comments: 'Uploads the resources.gz for the specific repo that comes from the webhook',
        lambda: 'resources_github_webhook',
        branch: 'master'
    }, {
        resource: 'api',
        name: 'api.github.com',
        comments: 'We use the api of github to get the resources.json file of each repo that comes from the webhook',
        path_name: 'getContents',
        lambda: 'resources_github_webhook',
        branch: 'master'
    }]
}

Try it out and let me know.

deliverymanager commented 5 years ago

Hello. It is a little strange but I don't see any difference. I have installed the 3.1.5 version.

This is what I get from the response of the query.

[
    {
      "resource": "s3\",,\"reporesources",
      "bucket_name": "resources.json",
      "directory_path": "Uploads the resources.gz for the specific repo that comes from the webhook\",,,,\"resources_github_webhook",
      "file_name": "master"
    },
    {
      "resource": "api",
      "name": "api.github.com\",,,,\"We use the api of github to get the resources.json file of each repo that comes from the webhook",
      "bucket_name": "getContents\",,,\"resources_github_webhook",
      "directory_path": "master"
    },
    {
      "resource": "module",
      "name": "util\",,,,\"mpla mpla mpla\",,\"login_mm\",,\"fetchCustomer",
      "bucket_name": "production"
    },
    {
      "resource": "dynamo",
      "name": "customers\",,,,\"mpla\",,,\"table",
      "bucket_name": "fetchCustomer",
      "directory_path": "production"
    },
    {
      "resource": "dynamo",
      "name": "storeCustomers\",,,,\"mpla\",,,\"table",
      "bucket_name": "fetchCustomer",
      "directory_path": "production"
    },
    {
      "resource": "athena\",,,,,,,,,\"format_athena_table",
      "name": "master"
    }
  ]

and below is the csv

"resource","name","bucket_name","directory_path","file_name","comments","path_name","function_name","index_name","lambda","branch"
"athena",,,,,,,,,"format_athena_table","master"
"s3",,"reporesources","","resources.json","Uploads the resources.gz for the specific repo that comes from the webhook",,,,"resources_github_webhook","master"
"api","api.github.com",,,,"We use the api of github to get the resources.json file of each repo that comes from the webhook","getContents",,,"resources_github_webhook","master"
"module","util",,,,"mpla mpla mpla",,"login_mm",,"fetchCustomer","production"
"dynamo","customers",,,,"mpla",,,"table","fetchCustomer","production"
"dynamo","storeCustomers",,,,"mpla",,,"table","fetchCustomer","production"

deliverymanager commented 5 years ago

Another example that returned weird data is:

bucket_name,"directory_path","file_name","comments","lambda","branch"
reporesources,"","resources.json","Uploads the resources.gz for the specific repo that comes from the webhook","resources_github_webhook","master"
reporesources,"","validation.json","Uploads the validation.json so that the other lambdas can read it","resources_github_webhook","master"

Which returned a row attr for some reason.

[
    {
      "row": "\"bucket_name\",\"directory_path\",\"file_name\",\"comments\",\"lambda\",\"branch\""
    },
    {
      "row": "\"reporesources\",\"\",\"resources.json\",\"Uploads the resources.gz for the specific repo that comes from the webhook\",\"resources_github_webhook\",\"master\""
    },
    {
      "row": "\"reporesources\",\"\",\"validation.json\",\"Uploads the validation.json so that the other lambdas can read it\",\"resources_github_webhook\",\"master\""
    }
  ]

Trying the same csv online gives me the following: https://www.csvjson.com/csv2json/9e28b64040aba3e73406ef38cfcde994

ghdna commented 5 years ago

Quick question: In your config object, do you have formatJson: false ?

wasserholz commented 5 years ago

I have the exact same issue, version 3.2.5 installed jsonFormat: true, if jsonFormat is false, i get no response at all

ghdna commented 5 years ago

This is what I get from athena-express with select * query on your other sample data

{
    Items: [{
        bucket_name: 'bucket_name',
        directory_path: 'directory_path',
        file_name: 'file_name',
        comments: 'comments',
        lambda: 'lambda',
        branch: 'branch'
    }, {
        bucket_name: 'reporesources',
        file_name: 'resources.json',
        comments: 'Uploads the resources.gz for the specific repo that comes from the webhook',
        lambda: 'resources_github_webhook',
        branch: 'master'
    }, {
        bucket_name: 'reporesources',
        file_name: 'validation.json',
        comments: 'Uploads the validation.json so that the other lambdas can read it',
        lambda: 'resources_github_webhook',
        branch: 'master'
    }]
}

How are you importing your csv into athena?

Following is how I added. Does your SERDEPROPERTIES match mine?

CREATE EXTERNAL TABLE IF NOT EXISTS results4 (
bucket_name string,
directory_path string, 
file_name string,
comments string,
lambda string,
branch string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'quoteChar' = '"',
  'field.delim' = ','
) LOCATION 's3://mybucketname/myfolder/'
TBLPROPERTIES ('has_encrypted_data'='false');

wasserholz commented 5 years ago

I use AWS Firehose. When I query the data in Athena through AWS web interface the table looks fine. By the way I use structs as column types in some cases.

ghdna commented 5 years ago

I have the exact same issue, version 3.2.5 installed jsonFormat: true, if jsonFormat is false, i get no response at all

Just pushed 3.2.5, which fixed the jsonFormat: false issue. This is the raw data in S3 and will looked messed up.

ghdna commented 5 years ago

I use AWS Firehose. When I query the data in Athena through AWS web interface the table looks fine. By the way I use structs as column types in some cases.

I wanna try and replicate, can you send me your create table query and export the csv from Athena?

deliverymanager commented 5 years ago

Have you pushed the new version to npm?

wasserholz commented 5 years ago

I followed this tutorial. https://aws.amazon.com/blogs/machine-learning/build-a-social-media-dashboard-using-machine-learning-and-bi-services/

It looks like, the csv parser is not working properly. Edit: empty cells seem to be ignored somehow.

`CREATE EXTERNAL TABLE socialanalyticsblog.tweets ( coordinates STRUCT< type: STRING, coordinates: ARRAY< DOUBLE

, retweeted BOOLEAN, source STRING, entities STRUCT< hashtags: ARRAY< STRUCT< text: STRING, indices: ARRAY< BIGINT

, urls: ARRAY< STRUCT< url: STRING, expanded_url: STRING, display_url: STRING, indices: ARRAY< BIGINT

, reply_count BIGINT, favorite_count BIGINT, geo STRUCT< type: STRING, coordinates: ARRAY< DOUBLE

, id_str STRING, timestamp_ms BIGINT, truncated BOOLEAN, text STRING, retweet_count BIGINT, id BIGINT, possibly_sensitive BOOLEAN, filter_level STRING, created_at STRING, place STRUCT< id: STRING, url: STRING, place_type: STRING, name: STRING, full_name: STRING, country_code: STRING, country: STRING, bounding_box: STRUCT< type: STRING, coordinates: ARRAY< ARRAY< ARRAY< FLOAT

, favorited BOOLEAN, lang STRING, in_reply_to_screen_name STRING, is_quote_status BOOLEAN, in_reply_to_user_id_str STRING, user STRUCT< id: BIGINT, id_str: STRING, name: STRING, screen_name: STRING, location: STRING, url: STRING, description: STRING, translator_type: STRING, protected: BOOLEAN, verified: BOOLEAN, followers_count: BIGINT, friends_count: BIGINT, listed_count: BIGINT, favourites_count: BIGINT, statuses_count: BIGINT, created_at: STRING, utc_offset: BIGINT, time_zone: STRING, geo_enabled: BOOLEAN, lang: STRING, contributors_enabled: BOOLEAN, is_translator: BOOLEAN, profile_background_color: STRING, profile_background_image_url: STRING, profile_background_image_url_https: STRING, profile_background_tile: BOOLEAN, profile_link_color: STRING, profile_sidebar_border_color: STRING, profile_sidebar_fill_color: STRING, profile_text_color: STRING, profile_use_background_image: BOOLEAN, profile_image_url: STRING, profile_image_url_https: STRING, profile_banner_url: STRING, default_profile: BOOLEAN, default_profile_image: BOOLEAN , quote_count BIGINT ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'`

01fb8c5f-054d-4dc8-96ae-7262d1a88384.csv.zip

ghdna commented 5 years ago

Have you pushed the new version to npm?

Yeah, 35 minutes ago @3.2.5

deliverymanager commented 5 years ago

I have the exact same issue, version 3.2.5 installed jsonFormat: true, if jsonFormat is false, i get no response at all

Just pushed 3.3.5, which fixed the jsonFormat: false issue. This is the raw data in S3 and will looked messed up.

I am testing 3.2.5 version. Nothing changed yet. Did you create a 3.3.5 version?

deliverymanager commented 5 years ago

The problem is in the statementType which in my case is undefined and it is not DML. You should probably place a param statementType to DML by default?

ghdna commented 5 years ago

I have the exact same issue, version 3.2.5 installed jsonFormat: true, if jsonFormat is false, i get no response at all

Just pushed 3.3.5, which fixed the jsonFormat: false issue. This is the raw data in S3 and will looked messed up.

I am testing 3.2.5 version. Nothing changed yet. Did you create a 3.3.5 version?

3.2.5 is currently the latest with the only change being the fix for formatJson: false. Try querying on the sample data e.g. select * from elb_logs LIMIT 5 in sampledb

ghdna commented 5 years ago

The problem is in the statementType which in my case is undefined and it is not DML. You should probably place a param statementType to DML by default?

Oh, interesting. what is the specific query you are sending to Athena?

deliverymanager commented 5 years ago

SELECT bucket_name, directory_path, file_name, comments, lambda, branch FROM repos.resources WHERE resource='s3' AND bucket_name='reporesources';

deliverymanager commented 5 years ago

OK so the problem is the fact that the GetQueryExecution of AWS Athena, does not respond by default with a vaiue at StatementType. So If it does not return with DML value then the json trasnformation fails.

So you should consider the following:

            const s3Output =
                queryStatus.QueryExecution.ResultConfiguration
                .OutputLocation,
                statementType = queryStatus.QueryExecution.StatementType || "DML";

which solves the whole opened issue with the data etc.

wasserholz commented 5 years ago

@deliverymanager I tried it out, it did not change anything for me. The problem still exists that the csv is not properly turned into json.

This is the output I get. Everything is messed up from first field on. The raw csv output looks fine.

[{"coordinates":"\"false","retweeted":"<a href=\"\"https://www.sprinklr.com\"\" rel=\"\"nofollow\"\">Sprinklr</a>","source":"{hashtags=[], urls=[{url=https://t.co/MGEy1JOFfe, expanded_url=https://twitter.com/i/web/status/1085862649034915841, display_url=twitter.com/i/web/status/1…, indices=[116, 139]}]}","entities":"0","reply_count":"0\",,\"1085862649034915841","favorite_count":"1547724804539","geo":"true","id_str":"@Rlhoohah Thanks for your interest in this, Rich! I've made our RDS team aware of your request. Please keep an eye… https://t.co/MGEy1JOFfe","timestamp_ms":"0","truncated":"1085862649034915800","text":"false","retweet_count":"low","id":"Thu Jan 17 11:33:24 +0000 2019\",,\"false","possibly_sensitive":"en","filter_level":"Rlhoohah","created_at":"false","place":"733014462551216130","favorited":"{id=120967386, id_str=120967386, name=AWS Support, screen_name=AWSSupport, location=null, url=https://aws.amazon.com/premiumsupport/, description=We welcome AWS support questions and will guide you to the help you need., translator_type=none, protected=false, verified=true, followers_count=27029, friends_count=8, listed_count=350, favourites_count=1439, statuses_count=19059, created_at=Mon Mar 08 04:06:05 +0000 2010, utc_offset=null, time_zone=null, geo_enabled=false, lang=en, contributors_enabled=false, is_translator=false, profile_background_color=C0DEED, profile_background_image_url=http://abs.twimg.com/images/themes/theme1/bg.png, profile_background_image_url_https=https://abs.twimg.com/images/themes/theme1/bg.png, profile_background_tile=false, profile_link_color=1DA1F2, profile_sidebar_border_color=C0DEED, profile_sidebar_fill_color=DDEEF6, profile_text_color=333333, profile_use_background_image=true, profile_image_url=http://pbs.twimg.com/profile_images/1052324554171764736/LQoMp4Xr_normal.jpg, profile_image_url_https=https://pbs.twimg.com/profile_images/1052324554171764736/LQoMp4Xr_normal.jpg, profile_banner_url=https://pbs.twimg.com/profile_banners/120967386/1509397772, default_profile=true, default_profile_image=false}","lang":"0"}]

deliverymanager commented 5 years ago

Yes after initializing the StatementType to DML in case that the response did not contain it, the only time it worked perfectly, was if I had all the key and values filled with data.

After removing some values the same problem reappeared.

Would you consider placing a module csvtojson

const csvFilePath='<path to csv file>'
const csv=require('csvtojson')
csv()
.fromFile(csvFilePath)
.then((jsonObj)=>{
    console.log(jsonObj);
    /**
     * [
     *  {a:"1", b:"2", c:"3"},
     *  {a:"4", b:"5". c:"6"}
     * ]
     */ 
})

// Async / await usage
const jsonArray=await csv().fromFile(csvFilePath);

wasserholz commented 5 years ago

@deliverymanager that is exactly what I am doing now. Just deactivate jsonFormat and parse the csv with csvtojson.

@ghdna It would be great if you use csvtojson, instead of selfwritten parser. Your rule for cutting the strings is not robust enough.

function cleanUpDML(lineReader) {
    let headerList = [],
        isFirstRecord = true,
        cleanJson = [],
        noOfColumns = 0,
        singleJsonRow = {};

    return new Promise(function(resolve, reject) {
        lineReader
            .on("line", line => {
                line = line.substring(1, line.length - 1).split('","');

                if (isFirstRecord) {
                    headerList = line;
                    isFirstRecord = false;
                } else {
                    singleJsonRow = {};
                    noOfColumns = line.length;
                    for (let i = 0; i < noOfColumns; i++) {
                        if (line[i].length) {
                            singleJsonRow[[headerList[i]]] = line[i];
                        }
                    }
                    cleanJson.push(singleJsonRow);
                }
            })
            .on("close", function() {
                resolve(cleanJson);
            });
    });
}

ghdna commented 5 years ago

I'm trying out these scenarios and will post an update.

ghdna commented 5 years ago

Let me tackle this one by one. @deliverymanager : Are you able to get on a join.me session with me? I'd like to screenshare and troubleshoot this with you. I imported the data you provided and ran SELECT bucket_name, directory_path, file_name, comments, lambda, branch FROM test2 WHERE bucket_name='s3' and got

{ Items:
   [ { bucket_name: 's3',
       file_name: 'reporesources',
       lambda: 'resources.json',
       branch:
        'Uploads the resources.gz for the specific repo that comes from the webhook' } ] }

So I'd like to understand what's different between the csv you shared above and the one I imported into Athena.

deliverymanager commented 5 years ago

The reason this works, is the fact that every key has a value. Try it without giving value to some keys. Just give them empty string.

PS: it is not possible to make a session right now...

ghdna commented 5 years ago

Running the same query on Athena, I can see that the keys directory_path & comments are empty. Is this what you mean by empty key values?

Athena:

Athena-express:

deliverymanager commented 5 years ago

Yes exactly. Those are the empty ones

ghdna commented 5 years ago

Yes exactly. Those are the empty ones

Yeah, but as you can see in the screenshot, it's parsing it out and ignoring anything that is empty. That was the change done in 3.1.0.

ghdna commented 5 years ago

Yes exactly. Those are the empty ones

Yeah, but as you can see in the screenshot, it's parsing it out and ignoring anything that is empty. That was the change done in 3.1.0.

That's why I'd like to see what's going on on your side. Let me know when you have 30 mins later today. There is something else going on here.

ghdna commented 5 years ago

@wasserholz I'm trying to replicate your use case. With the create table command, when I import your .csv file, it comes up as empty in Athena - see this.

Can you confirm if this is the correct file that you are importing with your create table command?

ghdna commented 5 years ago

The problem is in the statementType which in my case is undefined and it is not DML. You should probably place a param statementType to DML by default?

Published v3.4.0 that defaults to DML.

wasserholz commented 5 years ago

@ghdna like I said, I don't use csv imports. I gave you the csv output of the table, as I have a stream input setup. athena-express returns the correct csv raw data. That means, the data received from athena is correct. The problem is the code, that converts it to json inside athena-express.

Take the lineReader and read the csv file I sent you. Then run it through your cleanUp functions and look at the output.

I am now reading the raw csv through athena-express and just use csvtojson to convert it to json as a workaround.

wasserholz commented 5 years ago

@ghdna so I debugged the code now and like I said the csv parser is not robust enough.

line = line.substring(1, line.length - 1).split('","');

This line messes up. Replace the whole process with csvtojson and it will work.

ghdna commented 5 years ago

Just pushed v4.0.0 that parses using csvtojson.

ghdna / athena-express

The response json returns the empty columns #2

Athena:

Athena-express: