stsievert / swix

Swift Matrix Library
http://docs.stsievert.com/swix/
MIT License
593 stars 54 forks source link

Enhance read csv function #21

Closed arondes closed 8 years ago

arondes commented 8 years ago

The function name is read_csv2. It takes two arguments: filename and whether it contains header (i.e. first row does not contain data, but variable names). I made several changes:

  1. As mentioned above, many times the csv files use first row as variable names, so by setting header=true we will ignore first row. The "header" just follows R's routine.
  2. Many times the csv files are created under Windows platform, so the line break is actually "\r\n" instead of "\n". I check if there is "\r" and delete all of them.
  3. The data in the csv file may not only be numeric. It might be categorical, e.g. "M" and "F", "Y" and "N". The original read_csv will ignore these categorical variables. I use different approach: I create a dictionary [String:Int], the key is the categorical value (e.g. "M") while the value is Int (e.g. 0). All the keys are stored in a set. A set means identical values will be considered as one element. Once we encounter a categorical value, first we check if it is in the set. If yes, we look up the dictionary and add its value to the data; Otherwise we create a key-value pair and insert the key into the set. So if the column is M,F,F,M, it will be 0,1,1,0 in the final data.
stsievert commented 8 years ago

Could we use optional args for most of these features? At first glance it seems like these could be integrated into the current read_csv with "newline=\n, header=True , categorical=True".

arondes commented 8 years ago

Yes. I think to add these options is very easy. Today I will make some changes. I also consider that if we have something like DataFrame in addition to matrix, then things will be more "natural". A DataFrame will be able to contain different types of data in different columns and each columns can have a name.

stsievert commented 8 years ago

Hm, swix is designed to be a sharp tool that makes working with numerical values easy, and I think a pandas like module is worthy of a separate repo. I'd definitely be inclined to provide options for categorical data but I'd be inclined to work with the existing framework (e.g., converting each category to a unique integer and returning which categories are which integers).

arondes commented 8 years ago

Yes, I agree with that. Actually to create a whole new DataFrame infrastructure seems daunting for me now. I've just learned Swift for one week :P I will continue to work on this project to see if I can add any values.

stsievert commented 8 years ago

This seems like a useful pull request -- I'd like to see optional arguments added.

arondes commented 8 years ago

Update:

Now it takes three arguments: File name, Whether first row should be included in the data, Which row is used to detect data type.

I did not ask for prefix argument because I impose the responsibility of "complete filename" to the user.

The line break and categorical values can now all be treated automatically so the user does not need to understand the technical details.

There are three types of line breaks I have encountered: \n, \r and \r\n. I unify them as \n

By default, I use first row of the data to automatically detect whether a column contains numeric or categorical data.

The "first row" can be automatically set no matter if a header is included in the file.

I leave this option to the user because sometimes the first row contains missing values so people may want to use other one.

All the categorical variables will be translated into numeric.

To keep things rigorous, every column will have its own coding. For example, column one may have "Red, Yellow, Blue" while column two may have "Yellow, Blue". Then they will be translated as 0,1,2 and 0,1. The Yellow in first column will be considered as different from the Yellow in the second column.

There are some csv in https://support.spatialkey.com/spatialkey-sample-csv-data/ I test "Sample insurance portfolio", "Real estate transactions", "Sales transactions". All work as expected.

I commented the original one in case of debug.

I also add read_csv_header function. It can be used to get the header information (usually it is the variable names) of a csv file.

stsievert commented 8 years ago

Also, I'd like to have read_csv return (optionally) two values: the header information and the array (as opposed to making another function read_csv_header. I'm pretty sure you can define two functions of the same name that return different things. I would have

func read_csv(..., skip_header=False) -> matrix, array
    var x:matrix = read_csv(skip_header=True)
    names = // read header out...
    return matrix, names
func read_csv(..., skip_header=True) -> matrix
    // read comma separated values
    return matrix
arondes commented 8 years ago

The read_csv can automatically detect if a column in the data should be treated as numeric or categorical. To do this it requires to detect a particular row. By default the option is first row of Data. For example, if it sees "F, 1.0, Green, 100" then it will assumes column 0 and 2 are categorical while others are numeric. However, sometimes user may find that first row contain missing value, then it is better to detect other row. This is the reason why I provide this option.

Suppose we choose detectRow=1.

First we pick up the row we are going to detect and split by ",":

let test = y[detectRow + startrow - 1].componentsSeparatedByString(",")

If startrow=1 (i.e. we have header) then we pick y[1], which is actually the second row in the file. Since the original file uses first row to store header, y[1] is the first row of data.

If startrow=0 (i.e. we have no header) then we pick y[0], which is also the first row of the data.

Next, we create an array, its size is the same as column size, all the initial values are -1.

categorical_col = Array(count:test.count, repeatedValue:-1)

We go through the row, if we find a column can not be translated into double, we mark the corresponding position of categorical_col as 0.

columns=0 for testtext in test{ if(Double(testtext) == nil){ categorical_col[columns]=0 } columns=columns+1 }

This finishes the initial detection.

Example csv: 1,F,G,10 2,F,B,100

After initial detection, categorical_col is [-1,0,0,-1]

When we read the first row, we check all the columns that are categorical. Since it is our first time to see "F" in column 1 and G in column 2, we assign: "1F“=0, ”2G“=0 And we also update the categorical_col as [-1,1,1,-1]

When we read second row, since we already saw F in column 1, we still translate it as 0. But it is our first time to see "B" in column 2, so we check the categorical_col and find that the "Next" value to be used should be 1, and we assign "2B"=1. The code is

if(levels.contains(name)) { array.append(Double(factor[name]!)) } else{ factor[name]=categorical_col[columns] levels.insert(name) categorical_col[columns]=categorical_col[columns]+1 array.append(Double(factor[name]!)) }

In the end, categorical_col will be something like [-1,-1,1,3,5,2,1,-1,-1]. "-1" means numeric columns. Positive values indicate how many different levels we have seen in this particular column.

arondes commented 8 years ago

As a statistician, most of the csv I encountered use Row 0 to store variable name information. Thus I pay special attention to it. The read_csv actually drops that information. User might still want them.

I have thought about several possible ways: ~ We can return multiple values, like a tuple (String array, matrix) ~ We can construct an unified type and return, like DataFrame ~ We can ask user to pass a String array by reference in to function, and we modify That String array

Your opinion is to override the read_csv by first approach. Actually, I agree that this is the easiest one. There are some potential risk though. See example

func a()->String{ return "a"}

func a()->Int{ return 0}

print(a())

This will not compile. We have to specify which a() we are using. Even "let b=a()" should be changed to "let b: Int = a()". Do you think we should add this complexity?

arondes commented 8 years ago

BTW, I think it is very easy to implement a more general read_table function, which can read TXT data file separated by any symbol, such as "\t", Space or "," And read_csv can be defined as a special case with read_table(sep=',')

stsievert commented 8 years ago

However, sometimes user may find that first row contain missing value

Ah, got it. detectRow is which row to determine if it's categorical or not. In that case, maybe rename it to completeDataRow or noMissingEnteriesInRow.

We can construct an unified type and return, like DataFrame

Maybe that's what we should do. Define a CSV class and have the elements be csv.header and csv.data. I like that solution much more. We can have the function return a CSV class and add csv.header depending on the flag. That simplifies the interface to the options below:

// option 1
var x = read_csv(...).data

// option 2
var csv = read_csv(...)
var (x, header) = csv.data, csv.header

And then we can define write_csv to take a CSV class.

arondes commented 8 years ago

Add csvFile class. read_csv now returns csvFile object. write_csv have several versions. For compatibility I still keep orginal version of write_csv. If you think it is necessary to remove original version then "savfig" function might need modification.

stsievert commented 8 years ago

LGTM. Tomorrow night I'll test it, merge and document the interface.

And good job on the interface; you can pass either a matrix or csvFile to write_csv.