mrchristine / db-migration

Databricks Migration Tools
Other
43 stars 27 forks source link

DDL extraction issue with control characters #44

Closed arjun-hareendran closed 3 years ago

arjun-hareendran commented 3 years ago

Hello

Just found some inconsitencey while extracting a DDL that make use of control characters.

Below was the DDL that as used to create the table

    CREATE EXTERNAL TABLE db.tab
    (
    h_ls_hash array<string>,
    ls_id STRING,
    bin array<STRING>,
    Class1  array<int>,
    Class1_valueString  array<string>,
    Class1_valueFrom  array<float>,
    Class1_valueTo  array<float>,
    Class2  array<int>,
    Class2_valueString  array<string>,
    Class2_valueFrom  array<float>,
    Class2_valueTo  array<float>,
    Class3  array<int>,
    Class3_valueString  array<string>,
    Class3_valueFrom  array<float>,
    Class3_valueTo  array<float>,
    load_ts timestamp COMMENT 'EN: load timestamp | DE: Zeitstempel für das Laden des Datensatzes',
    record_source string COMMENT 'EN: Source Name | DE: Quellenname'
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\u0001'
    STORED AS TEXTFILE
    location 'dbfs:/loc';

but when the DDL's were extracted using the migration tool it resulted in the below.

<div class="ansiout">CREATE EXTERNAL TABLE `db`.`tab`(`h_ls_hash` ARRAY&lt;STRING&gt;, `ls_id` STRING, `bin` ARRAY&lt;STRING&gt;, `Class1` ARRAY&lt;INT&gt;, `Class1_valueString` ARRAY&lt;STRING&gt;, `Class1_valueFrom` ARRAY&lt;FLOAT&gt;, `Class1_valueTo` ARRAY&lt;FLOAT&gt;, `Class2` ARRAY&lt;INT&gt;, `Class2_valueString` ARRAY&lt;STRING&gt;, `Class2_valueFrom` ARRAY&lt;FLOAT&gt;, `Class2_valueTo` ARRAY&lt;FLOAT&gt;, `Class3` ARRAY&lt;INT&gt;, `Class3_valueString` ARRAY&lt;STRING&gt;, `Class3_valueFrom` ARRAY&lt;FLOAT&gt;, `Class3_valueTo` ARRAY&lt;FLOAT&gt;, `load_ts` TIMESTAMP COMMENT &#39;EN: load timestamp | DE: Zeitstempel für das Laden des Datensatzes&#39;, `record_source` STRING COMMENT &#39;EN: Source Name | DE: Quellenname&#39;)
ROW FORMAT SERDE &#39;org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe&#39;
WITH SERDEPROPERTIES (
  &#39;field.delim&#39; = &#39;&#39;,
  &#39;serialization.format&#39; = &#39;&#39;
)
STORED AS
  INPUTFORMAT &#39;org.apache.hadoop.mapred.TextInputFormat&#39;
  OUTPUTFORMAT &#39;org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat&#39;
LOCATION &#39;dbfs:/loc&#39;
TBLPROPERTIES (
  &#39;transient_lastDdlTime&#39; = &#39;1606221712&#39;
)

</div>

Can someone help me on this ?

mrchristine commented 3 years ago

@arjun-hareendran can you try with the new option here, --metastore-unicode to verify your issue is resolved?

mrchristine commented 3 years ago

@arjun-hareendran were you able to get the new flag a try?