S3-SQS source does not populate partition columns in the dataframne

qubole / s3-sqs-connector

A library for reading data from Amzon S3 with optimised listing using Amazon SQS using Spark SQL Streaming ( or Structured streaming).

http://www.qubole.com

Apache License 2.0

17 stars 12 forks source link

S3-SQS source does not populate partition columns in the dataframne #2

Open DipeshV opened 4 years ago

DipeshV commented 4 years ago

Hi, I are using this "s3-sqs" connector with spark structured streaming and deltalake to process incoming data in partitioned s3 buckets. The problem I are facing is with "s3-sqs" source is that the file is directly read and returns a dataframe/dataset without the partition columns. Hence, when we merge the source and target dataframes, we get all the partition columns as HIVE_DEFAULT_PARTITION.

Do have any solution/workaround to add partition colums as a part of dataframe??

Thanks and regards, Dipesh Vora

abhishekd0907 commented 4 years ago

@DipeshV seems like a bug. Thanks for pointing this out. I will work on the fix.

DipeshV commented 4 years ago

Hi Abhishek,

I am currently adding partition manually, which makes my code a bit messy and cannot be used as is while adding new integrations. Do we have any fix for this?

Thanks, Dipesh

abhishekd0907 commented 4 years ago

@DipeshV yeah i'll raise a PR for the fix today.

abhishekd0907 commented 4 years ago

@DipeshV I've created a pull request. Can you build a jar from the new branch and try it out?

abhishekd0907 commented 4 years ago

@DipeshV Did you get a chance to try out the new code? Does it solve your use case?

DipeshV commented 4 years ago

@abhishekd0907 - I haven't checked the new code since I had currently manually added the partitions from input_file_name(). But I will test it though with the new code.