Closed OpenCoderX closed 8 years ago
Currently I profile data as I go (incrementally) developing the ETL itself. I've not yet built reusable components to profile in a generic fashion, so it's more an ad-hoc each time I write a job.
Here are various techniques I use currently.
I pretty print rows with awesome_print
like this:
transform do |row|
ap row
ap row.slice(:field, :other_field)
row
end
I limit the data to a subset with something like this:
limit = Integer(ENV['LIMIT'] || 0)
transform do |row|
@counter ||= 0
@counter += 1
abort("Limit reached") if @counter > limit
row
end
To check uniqueness I can use things like:
unique_values = []
transform do |row|
value = row[:field]
unique_values << value unless unique_values.include?(value)
row
end
post_process do
ap unique_values
end
Or for something with a lot more unique values, I could do:
destination CSVDestination, file: 'profile.csv', fields: :my_field
post_process do
system! "cat profile.csv | uniq"
end
I can also inject tools like csvfix
into the mix etc.
Another nice technique is to run textql
on top of the outputted CSV (even if my ultimate goal is a database destination), to quickly get interesting counts.
That's just a start and it would require a series of articles, but in short I use Ruby (possibly extracting small reusable helpers) to do the data profiling if the scale works nicely, and I turn to faster third-party tools which I can from Ruby if I need something to work on larger volumes.
Let me know if it properly answers your question!
Will document in #29. Closing!
What is the best way to generate a data profile using Kiba? Before I start to build an etl job i like to determine various facts about the dataset like uniqueness, completeness, data types, null counts, formats, min, max, central tendency, possible keys etc. If there is not an existing method in Kiba, are there any suggestions on creating a profile in a ruby way? There are many commercial tools that can profile a dataset but I want to move the profiling step into my ruby codebase. Any thoughts? How do you all handle this now?