srvarey / gbif-occurrencestore

Automatically exported from code.google.com/p/gbif-occurrencestore
0 stars 0 forks source link

Sanitize all fields on sqoop in tables in Portal #28

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
\n \r and \t really affect Hive after a sqoop in

Develop a .sql script that can sanitize the input database before the whole 
process

For example:
mysql> select data_resource_id,author from raw_occurrence_record where 
id=57325451; 
+------------------+-------------+
| data_resource_id | author      |
+------------------+-------------+
|             1472 | J. Agardh  ` | 
+------------------+-------------+

Has a \t character in the Author which blows all processing

Original issue reported on code.google.com by timrobertson100 on 27 Apr 2011 at 6:48

GoogleCodeExporter commented 9 years ago

Original comment by timrobertson100 on 27 Apr 2011 at 6:48

GoogleCodeExporter commented 9 years ago
DELiMITER | 
CREATE FUNCTION clean(v VARCHAR(255)) 
RETURNS VARCHAR(255)
BEGIN
  RETURN TRIM(REPLACE(REPLACE(REPLACE(v, '\t', ''), '\r', ''), '\n', ''));
END
|
DELiMITER ;

Will help, as you can do:

update raw_occurrence_record set author=clean(author) ... etc

Original comment by timrobertson100 on 27 Apr 2011 at 7:03

GoogleCodeExporter commented 9 years ago
See attached 

Original comment by timrobertson100 on 27 Apr 2011 at 7:31

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by oliver.m...@gmail.com on 28 Apr 2011 at 7:24

GoogleCodeExporter commented 9 years ago
Suggest changing this to a VIEW and not an cleaning script

Original comment by timrobertson100 on 28 Apr 2011 at 9:26

GoogleCodeExporter commented 9 years ago
An example of what a script would then look like is:

DELiMITER | 
CREATE FUNCTION clean(v VARCHAR(255)) 
RETURNS VARCHAR(255)
BEGIN
  RETURN TRIM(REPLACE(REPLACE(REPLACE(v, '\t', ''), '\r', ''), '\n', ''));
END
|
DELiMITER ;

CREATE OR REPLACE VIEW view_sqoop_raw_occurrence_record AS
SELECT
  id,data_resource_id...,
  cleanVarchar(scientific_name)...
FROM raw_occurrence_record;

Original comment by timrobertson100 on 28 Apr 2011 at 9:31

GoogleCodeExporter commented 9 years ago
Committed the changes for all portal and agent tables.

Original comment by lars.fra...@gmail.com on 1 May 2011 at 3:30