Closed kat-co closed 5 years ago
Good questions. The way cl-ana.table works is optimized for larger data sets of varying formats, so row access is sequential. If you wanted to grab some subset of data, you would iterate over the data and store that subset in another table.
If you want random access capability for a data set, then you can still do it, but it will be dependent on the specific table type. E.g., I think I could quickly write a random access function for hdf-table, but I'm not sure about csv-table since the data is read from the source file.
What might be appropriate would be to provide a table type that loads the data into memory and allows random access and other transformations of the sort you're mentioned more conveniently.
For adding columns, this can be done in two different ways:
Creating a new table. This is most appropriate when you are also selecting a subset; in practice it is common to both define new columns and select a subset, since the values of some columns don't make sense for all rows.
Use logical fields or "lfields" in makeres-table. Logical fields are automatically computed during the iteration over the data set; this makes sense for large data sets where the extra hard drive space for all the new columns is more expensive than the small amount of extra computation time for each row of the data.
On pivoting/transposing data sets: Due to supporting source data formats that don't efficiently support pivoting/transposing, I didn't build that capability into the cl-ana.table API. So it would be similar to the case of random access for a single row: You can do it, but it needs to be done on a per-table basis.
On computing a single value for a column: This is conceptualized as a reduction of the table, which can involve any number of columns and even the logical fields that I mentioned before. This kind of calculation is best described using makeres-table using the "dotab" operator.
Cool! I still need to study the DOP stuff more. Specifically, after looking at the wiki, it is still not clear to me how to use the dotab
operator.
According to Hands-on Machine Learning..., it is very common to apply many transformations to your data sets:
A sequence of data processing components is called a data pipeline. Pipelines are very common in Machine Learning systems, since there is a lot of data to manipulate and many data transformations to apply.
In addition, it seems like the first step of building machine learning models is to explore the rough shape of the data. Most of the operations I've been exposed to so far have been column-oriented operations, e.g.:
CL-USER> *dat*
#<CL-ANA.REUSABLE-TABLE:REUSABLE-TABLE {100415B273}>
CL-USER> (summarize *dat*)
"RangeIndex: 20640 entries, -1 to -1
Data columns (total 10 columns):
longitude 20640 populated (SINGLE-FLOAT)
latitude 20640 populated (SINGLE-FLOAT)
housing_median_age 20640 populated (SINGLE-FLOAT)
total_rooms 20640 populated (SINGLE-FLOAT)
total_bedrooms 20433 populated (SINGLE-FLOAT)
population 20640 populated (SINGLE-FLOAT)
households 20640 populated (SINGLE-FLOAT)
median_income 20640 populated (SINGLE-FLOAT)
median_house_value 20640 populated (SINGLE-FLOAT)
ocean_proximity 20640 populated (SYMBOL STRING)
types: (STRING SYMBOL SINGLE-FLOAT)"
CL-USER> (analyze *dat*)
attr | longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value |
---|---|---|---|---|---|---|---|---|---|
count | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20433.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 |
mean | -119.569704 | 35.631861 | 28.639486 | 2635.763081 | 537.870553 | 1425.476744 | 499.539680 | 3.870671 | 206855.816909 |
std | 2.003532 | 2.135952 | 12.585558 | 2181.615252 | 421.385070 | 1132.462122 | 382.329753 | 1.899822 | 115395.615874 |
min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
25% | -121.800000 | 33.930000 | 18.000000 | 1447.750000 | 296.000000 | 787.000000 | 280.000000 | 2.563400 | 119600.000000 |
50% | -118.490000 | 34.260000 | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 | 409.000000 | 3.534800 | 179700.000000 |
75% | -118.010000 | 37.710000 | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 | 605.000000 | 4.743250 | 264725.000000 |
max | -114.310000 | 41.950000 | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
And then there's dropping columns which will interfere with producing a useful model, and various operations done on single columns.
I think all of this is probably possible with table reductions, and logical fields, but I think it would take multiple passes over the table, O(N)
where N
is the number of rows. If we were to first transpose the table to a columnar table, and then do these kinds of summarizing tasks, we can do 1 pass over the table, and then O(N)
sequential reads, where N
is the number of columns. In the example data in the book, rows=20640, columns=10. Multiply these by the number of column-based operations you'd like to do.
I have a prototype columnar-table
type which is created by calling pivot
. It takes in create-table
and open-table
functions much like reusable-table
, and writes columns to rows of the underlying table. At worst, for underlying tables which are not in memory, it will only store N
column values in memory, where N
is the number of rows. I think this should allow you to pivot pretty large data sets.
This has allowed me to write a prototype table-get-field
method which, for columnar-table
is very fast. I can imagine this being a generic method for all tables, and on non-columnar tables, it would just be a convenience function to table-reduce
a column into a vector.
Anyway, that's a lot of information. What do you think of this approach? I think it would be beneficial to summarizing type functionality which I am currently focused on adding to cl-ana
.
Here's some anecdotal benchmarking on the two approaches:
CL-USER> *dat*
#<CL-ANA.REUSABLE-TABLE:REUSABLE-TABLE {10040FB273}>
CL-USER> (cl-ana.reusable-table:internal-table *dat*)
#<CL-ANA.CSV-TABLE:CSV-TABLE {100449B1F3}>
CL-USER> (time (cl-ana.statistics:mean (field-values *dat* :|longitude|)))
Evaluation took:
0.643 seconds of real time
0.643913 seconds of total run time (0.643884 user, 0.000029 system)
[ Run times consist of 0.009 seconds GC time, and 0.635 seconds non-GC time. ]
100.16% CPU
1,358,608,296 processor cycles
164,402,416 bytes consed
CL-USER> *pivot*
#<CL-ANA.REUSABLE-TABLE:REUSABLE-TABLE {1002A9FDE3}>
CL-USER> (cl-ana.reusable-table:internal-table *pivot*)
#<COLUMNAR-TABLE {10033A5483}>
CL-USER> (backing-table (cl-ana.reusable-table:internal-table *pivot*))
#<CL-ANA.REUSABLE-TABLE:REUSABLE-TABLE {1002DBB8D3}>
CL-USER> (cl-ana.reusable-table:internal-table
(backing-table (cl-ana.reusable-table:internal-table *pivot*)))
#<CL-ANA.CSV-TABLE:CSV-TABLE {100396CA23}>
CL-USER> (time (cl-ana.statistics:mean
(field-values (cl-ana.reusable-table:internal-table *pivot*) 'longitude)))
Evaluation took:
0.083 seconds of real time
0.082860 seconds of total run time (0.082860 user, 0.000000 system)
100.00% CPU
174,490,190 processor cycles
5,893,264 bytes consed
Now imagine many column-based operations being performed, and maybe a larger data set, and I think this is compelling.
Ah, you're actually walking down the same path that led me to make the DOP makeres-table library. Without makeres and transforming the target table, it is extremely inefficient to analyze data in a column-wise way like what is usually done and is very natural to do. However, what makeres-table does is figure out the minimum number of passes required to calculate whatever reductions you want, and then computes them in parallel in as few passes over the data set as are required by the dependencies.
For your example, all of the reductions ("mean", "count", "std", etc.) would only require one pass mathematically, and so if you define them with "dotab" then they would be computed in a single pass over the table.
I'm completely fine with including pivoting/transposing utilities though; it's a generally useful operation. The problem I had with using pivoting/transposing for my analysis is that I was working with multiple TB of data stored in a tape library, so I came up with the DOP makeres-table approach.
I really need to sit down and learn makeres
stuff better (I'm quasi crunched for time at the moment). Will makeres
compute the minimum number of passes across multiple analyses? E.g. if I want to calculate what's in analyze
above and what's in summarize
above, will makeres
still find the minimum number of passes?
I'm happy to hear that this functionality is still welcome. I think it would still be useful when working with data when you're unsure the exact operations you want to perform, and are just throwing commands at the data set in real-time (probably not your use-case!).
Not to overload this ticket too much, but another thing I'm running into as I've been hacking is the fact that fields
is not evaluated in the do-table
macro. With a lot of these summarizing methods, part of what you want summarized (apparently) is the list of fields which you don't know at macro-expansion time. I'll just leave it at that and see if you have any input :)
(require 'cl-ana)
(in-package :cl-ana)
;; Define the project:
(defproject katherine-example-project
"/path/to/result/log/"
;; The basic transformations to load for this project:
(list #'macrotrans #'tabletrans #'progresstrans)
;; Cache size is up to you and your system memory
(fixed-cache 5))
;; Define source data
(defres (data path)
"/path/to/data.csv")
(defres data
(srctab (csv-opener (res (data path))
:read-from-string t)))
;; Some macros that should be included with makeres-table that I'll
;; add:
(define-res-macro dotab-nrows (src)
"Counts number of rows in table."
`(dotab ,src
((count 0))
count
(incf count)))
(define-res-macro dotab-mean (src expr)
"Computes the mean of some expression from the table."
`(dotab ,src
((count 0)
(sum 0))
(/ sum count)
(incf count)
(incf sum ,expr)))
(define-res-macro dotab-standard-deviation (src expr)
"Computes single pass standard deviation for some expression in the
table. Set protected to some numerical value to use protected-div
with that value."
`(dotab ,src
((count 0)
(sum 0)
(ssum 0))
(handler-case
(sqrt (/ (- ssum (/ (expt sum 2) count))
(1- count))))
(let ((expr ,expr))
(incf count)
(incf sum expr)
(incf ssum (expt expr 2))))))
;;; Define these for our data set:
(defres (data nrows)
(dotab-nrows (res data)))
(defres (data x mean)
(dotab-mean (res data)
;; assuming there is a field x in the data
(field x)))
(defres (data x std)
(dotab-standard-deviation (res data)
;; same assumption
(field x)))
;;;; From the code, these are defined but might not have been computed
;;;; yet. To actually compute them, use the #'makeres function. If
;;;; they have already been computed, you can just run #'load-project
;;;; and then #'makeres to ready the target-table for accessing the
;;;; computed values using the "res" macro.
The above example should give an idea of how "makeres" lets you compute various reductions.
And yes: So long as you have defined all your targets with "dotab", makeres-table will figure out the minimum number of passes required for all the targets that need to be computed.
If you want to analyze data without knowing the list of fields at compile time, then you can still do it but you are forced to fall back to a non-macro approach using the functions #'table-field-names, #'table-get-field, etc.
And you are correct: The "fields" argument to do-table is not evaluated because it needs to be supplied at compile time to improve performance. You can actually use NIL as the fields argument and do-table will bind variables in a less efficient way after figuring out the available fields at run time. It might actually be possible to hack the code I used in do-table to provide the kinds of summary information you're after; it uses the function approach I listed above along with setting variable values in the loop.
Say I have the following table:
There are several operations we may want to perform on this table in a columnar fashion:
(cl-ana.statistics:standard-deviation (cl-ana.table:table-column "A"))
I'm still learning cl-ana, so I'm unsure if there's already an established way to perform these types of operations. Is there?
If not, what might we want this type of access pattern to look like? Currently, I'm thinking a
(defun pivot (table create-table-fn) ...)
function. This would transform the table into a columnar table (in-memory or not, depending on what type of table the closure is creating) which would make it easy to get all the values in a column by getting a single row (which, while we're here, is there already a way to get a specific row of data?).So this would make the above table look something like this:
What does this mean for cl-ana's underlying table representation?