hosseinmoein / DataFrame

C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage
https://hosseinmoein.github.io/DataFrame/
BSD 3-Clause "New" or "Revised" License
2.46k stars 310 forks source link

[Feature Request] Automatic columns detection for read(). #180

Closed kilasuelika closed 2 years ago

kilasuelika commented 2 years ago

I suggest provide a new normal_csv format that the column type is automatically detected. Use a two-pass procedure. First read some initial rows to decide column type and then the second pass to read values.

The current csv and csv2 format is rarelly used and inconvenient.

kilasuelika commented 2 years ago

Deciding column types for string, int and double are easy but for datetime it will be very difficult.

hosseinmoein commented 2 years ago

I hear what you are saying. There are definitely advantages to it. The only thing is that the implementation could be very bug prone. I have to think about it.

BTW, I know of people who use both csv and csv2 formats. Personally I like csv better. It is more compact, it is columnar, and read/writes are more efficient

hosseinmoein commented 2 years ago

Even auto-detecting between int and double is not that simple. Imagine if a few first items of a double column happen to be integers. Or if the data by nature is integer but the user wants to have a double column for calculation purposes. For example, number of shares traded is an integer value by nature. But most often it is stored as a floating point because the column is involved in calculations and comparisons with other floating point figures.

hosseinmoein commented 2 years ago

when you write the data in DataFrame by using the write() function, it will store all the info in needs to read it back. Also, the same thing is to_string() and from_string()

kilasuelika commented 2 years ago

After some research, I decided to write my own DataFrame library. You can check it on DataFrameCpp