saurfang / spark-sas7bdat

Splittable SAS (.sas7bdat) Input Format for Hadoop and Spark SQL
http://spark-packages.org/package/saurfang/spark-sas7bdat
Apache License 2.0
89 stars 40 forks source link

Issue : The tab space after the string is removed after reading in by the jar #62

Open jatinbajaj777 opened 3 years ago

jatinbajaj777 commented 3 years ago

The tab space after any string value in a column is removed after reading in by the jar. The length of the below data frame is different.

+---------+------------+-----------------+------------+ |Text_Only|Text_Tab_Beg| Text_Tab_Mid|Text_Tab_End| +---------+------------+-----------------+------------+ | ABCDEFGH| ABCDEFGH|ABCDEFGH IJKLMNOP| ABCDEFGH| +---------+------------+-----------------+------------+

Function.length gives below results

Text_Only=8 Text_Tab_Beg=9 Text_Tab_End=8

The input data is thus changed removing the end tab space

thesuperzapper commented 3 years ago

@jatinbajaj777 Can you please provide a basic SAS7BDAT file which causes this error?

jatinbajaj777 commented 3 years ago

Attaching the sample SAS7BDAT file which covers all above scenarios.

atest.sas7bdat.txt

Also, the code below to generate the SAS7BDAT file.

Text_Only     = "ABCDEFGH";
Text_Tab_Beg  = cats( '09'x, "ABCDEFGH" );
Text_Tab_End  = cats( "ABCDEFGH", '09'x );
Text_Tab_Mid  = cats( "ABCDEFGH", '09'x, "IJKLMNOP" );
thesuperzapper commented 3 years ago

@jatinbajaj777 Can you also provide that same data in a common format (like CSV) so we can see what it should look like?