thbar / kiba

Data processing & ETL framework for Ruby
https://www.kiba-etl.org
Other
1.75k stars 87 forks source link

Profiling of data #22

Closed OpenCoderX closed 8 years ago

OpenCoderX commented 9 years ago

What is the best way to generate a data profile using Kiba? Before I start to build an etl job i like to determine various facts about the dataset like uniqueness, completeness, data types, null counts, formats, min, max, central tendency, possible keys etc. If there is not an existing method in Kiba, are there any suggestions on creating a profile in a ruby way? There are many commercial tools that can profile a dataset but I want to move the profiling step into my ruby codebase. Any thoughts? How do you all handle this now?

thbar commented 9 years ago

Currently I profile data as I go (incrementally) developing the ETL itself. I've not yet built reusable components to profile in a generic fashion, so it's more an ad-hoc each time I write a job.

Here are various techniques I use currently.

I pretty print rows with awesome_print like this:

transform do |row|
  ap row
  ap row.slice(:field, :other_field)
  row
end

I limit the data to a subset with something like this:

limit = Integer(ENV['LIMIT'] || 0)

transform do |row|
  @counter ||= 0
  @counter += 1
  abort("Limit reached") if @counter > limit
  row
end

To check uniqueness I can use things like:

unique_values = []
transform do |row|
  value = row[:field]
  unique_values << value unless unique_values.include?(value)  
  row
end

post_process do
  ap unique_values
end

Or for something with a lot more unique values, I could do:

destination CSVDestination, file: 'profile.csv', fields: :my_field

post_process do
  system! "cat profile.csv | uniq"
end

I can also inject tools like csvfix into the mix etc.

Another nice technique is to run textql on top of the outputted CSV (even if my ultimate goal is a database destination), to quickly get interesting counts.

That's just a start and it would require a series of articles, but in short I use Ruby (possibly extracting small reusable helpers) to do the data profiling if the scale works nicely, and I turn to faster third-party tools which I can from Ruby if I need something to work on larger volumes.

Let me know if it properly answers your question!

thbar commented 8 years ago

Will document in #29. Closing!