Closed arondes closed 8 years ago
Could we use optional args for most of these features? At first glance it seems like these could be integrated into the current read_csv with "newline=\n, header=True , categorical=True".
Yes. I think to add these options is very easy. Today I will make some changes. I also consider that if we have something like DataFrame in addition to matrix, then things will be more "natural". A DataFrame will be able to contain different types of data in different columns and each columns can have a name.
Hm, swix is designed to be a sharp tool that makes working with numerical values easy, and I think a pandas like module is worthy of a separate repo. I'd definitely be inclined to provide options for categorical data but I'd be inclined to work with the existing framework (e.g., converting each category to a unique integer and returning which categories are which integers).
Yes, I agree with that. Actually to create a whole new DataFrame infrastructure seems daunting for me now. I've just learned Swift for one week :P I will continue to work on this project to see if I can add any values.
This seems like a useful pull request -- I'd like to see optional arguments added.
Update:
Now it takes three arguments: File name, Whether first row should be included in the data, Which row is used to detect data type.
I did not ask for prefix argument because I impose the responsibility of "complete filename" to the user.
The line break and categorical values can now all be treated automatically so the user does not need to understand the technical details.
There are three types of line breaks I have encountered: \n, \r and \r\n. I unify them as \n
By default, I use first row of the data to automatically detect whether a column contains numeric or categorical data.
The "first row" can be automatically set no matter if a header is included in the file.
I leave this option to the user because sometimes the first row contains missing values so people may want to use other one.
All the categorical variables will be translated into numeric.
To keep things rigorous, every column will have its own coding. For example, column one may have "Red, Yellow, Blue" while column two may have "Yellow, Blue". Then they will be translated as 0,1,2 and 0,1. The Yellow in first column will be considered as different from the Yellow in the second column.
There are some csv in https://support.spatialkey.com/spatialkey-sample-csv-data/ I test "Sample insurance portfolio", "Real estate transactions", "Sales transactions". All work as expected.
I commented the original one in case of debug.
I also add read_csv_header function. It can be used to get the header information (usually it is the variable names) of a csv file.
Also, I'd like to have read_csv
return (optionally) two values: the header information and the array (as opposed to making another function read_csv_header
. I'm pretty sure you can define two functions of the same name that return different things. I would have
func read_csv(..., skip_header=False) -> matrix, array
var x:matrix = read_csv(skip_header=True)
names = // read header out...
return matrix, names
func read_csv(..., skip_header=True) -> matrix
// read comma separated values
return matrix
The read_csv can automatically detect if a column in the data should be treated as numeric or categorical. To do this it requires to detect a particular row. By default the option is first row of Data. For example, if it sees "F, 1.0, Green, 100" then it will assumes column 0 and 2 are categorical while others are numeric. However, sometimes user may find that first row contain missing value, then it is better to detect other row. This is the reason why I provide this option.
Suppose we choose detectRow=1.
First we pick up the row we are going to detect and split by ",":
let test = y[detectRow + startrow - 1].componentsSeparatedByString(",")
If startrow=1 (i.e. we have header) then we pick y[1], which is actually the second row in the file. Since the original file uses first row to store header, y[1] is the first row of data.
If startrow=0 (i.e. we have no header) then we pick y[0], which is also the first row of the data.
Next, we create an array, its size is the same as column size, all the initial values are -1.
categorical_col = Array(count:test.count, repeatedValue:-1)
We go through the row, if we find a column can not be translated into double, we mark the corresponding position of categorical_col as 0.
columns=0
for testtext in test{
if(Double(testtext) == nil){
categorical_col[columns]=0
}
columns=columns+1
}
This finishes the initial detection.
Example csv: 1,F,G,10 2,F,B,100
After initial detection, categorical_col is [-1,0,0,-1]
When we read the first row, we check all the columns that are categorical. Since it is our first time to see "F" in column 1 and G in column 2, we assign: "1F“=0, ”2G“=0 And we also update the categorical_col as [-1,1,1,-1]
When we read second row, since we already saw F in column 1, we still translate it as 0. But it is our first time to see "B" in column 2, so we check the categorical_col and find that the "Next" value to be used should be 1, and we assign "2B"=1. The code is
if(levels.contains(name))
{
array.append(Double(factor[name]!))
}
else{
factor[name]=categorical_col[columns]
levels.insert(name)
categorical_col[columns]=categorical_col[columns]+1
array.append(Double(factor[name]!))
}
In the end, categorical_col will be something like [-1,-1,1,3,5,2,1,-1,-1]. "-1" means numeric columns. Positive values indicate how many different levels we have seen in this particular column.
As a statistician, most of the csv I encountered use Row 0 to store variable name information. Thus I pay special attention to it. The read_csv actually drops that information. User might still want them.
I have thought about several possible ways: ~ We can return multiple values, like a tuple (String array, matrix) ~ We can construct an unified type and return, like DataFrame ~ We can ask user to pass a String array by reference in to function, and we modify That String array
Your opinion is to override the read_csv by first approach. Actually, I agree that this is the easiest one. There are some potential risk though. See example
func a()->String{ return "a"}
func a()->Int{ return 0}
print(a())
This will not compile. We have to specify which a() we are using. Even "let b=a()" should be changed to "let b: Int = a()". Do you think we should add this complexity?
BTW, I think it is very easy to implement a more general read_table function, which can read TXT data file separated by any symbol, such as "\t", Space or "," And read_csv can be defined as a special case with read_table(sep=',')
However, sometimes user may find that first row contain missing value
Ah, got it. detectRow
is which row to determine if it's categorical or not. In that case, maybe rename it to completeDataRow
or noMissingEnteriesInRow
.
We can construct an unified type and return, like DataFrame
Maybe that's what we should do. Define a CSV class and have the elements be csv.header
and csv.data
. I like that solution much more. We can have the function return a CSV class and add csv.header
depending on the flag. That simplifies the interface to the options below:
// option 1
var x = read_csv(...).data
// option 2
var csv = read_csv(...)
var (x, header) = csv.data, csv.header
And then we can define write_csv
to take a CSV class.
Add csvFile class. read_csv now returns csvFile object. write_csv have several versions. For compatibility I still keep orginal version of write_csv. If you think it is necessary to remove original version then "savfig" function might need modification.
LGTM. Tomorrow night I'll test it, merge and document the interface.
And good job on the interface; you can pass either a matrix
or csvFile
to write_csv
.
The function name is read_csv2. It takes two arguments: filename and whether it contains header (i.e. first row does not contain data, but variable names). I made several changes: