h2oai / datatable

A Python package for manipulating 2-dimensional tabular data structures
https://datatable.readthedocs.io
Mozilla Public License 2.0
1.81k stars 155 forks source link

fread incorect index column (row names) [new feature] #3453

Closed mcjmigdal closed 1 year ago

mcjmigdal commented 1 year ago
   | AAACCCAAGCGGGTAT-1_1  AAACCCAAGGTGATAT-1_1  AAACCCAAGTTTGAGA-1_1  …  TTTGTTGTCTACGCGG-1_4     C0
   | str32                                bool8                 bool8                    bool8  bool8
-- + --------------------  --------------------  --------------------     --------------------  -----
 0 | gene1                                    0                     0  …                     0      0
 1 | gene2                                    0                     0  …                     0      0
 2 | gene3                                    0                     0  …                     0      0
 3 | gene4                                    0                     0  …                     0      0
[4 rows x 34677 columns]
oleksiyskononenko commented 1 year ago

In the case we autodetect a header row, all the values from that row become column names. In your file there is no “index” value in the header row, hence, datatable can’t get it from anywhere else.

Your file needs to start with “index” value in the case you want it to become the first column name.

oleksiyskononenko commented 1 year ago

I don’t think dt ever claimed to support row names, so this issue should be marked as a feature request and not a bug. For sure It is possible to do what R data.table is doing.

mcjmigdal commented 1 year ago

Thank you for quick response, I've changed the bug in the title to feature request. I thought it was a bug because of what I observed while playing with smaller dataset, I tried to recreate my experiments and it seems the behavior is different for different column separators.

data = "A B C\ngene1 1 2 3\ngene2 3 2 1"
dt.fread(data, sep=" ")
#    | index      A      B      C
#    | str32  int32  int32  int32
# -- + -----  -----  -----  -----
#  0 | gene1      1      2      3
#  1 | gene2      3      2      1
# [2 rows x 4 columns]
dt.fread(data.replace(" ", "\t"), sep="\t")
#    | A          B      C     C0
#    | str32  int32  int32  int32
# -- + -----  -----  -----  -----
#  0 | gene1      1      2      3
#  1 | gene2      3      2      1
# [2 rows x 4 columns]
oleksiyskononenko commented 1 year ago

Hm, that’s interesting. I will look into it, we indeed either should document this behavior, or do not support.

oleksiyskononenko commented 1 year ago

So the issue was not the number of columns, but the separator. For some reason we only supported space, when detected row names. Once #3455 is merged, you can fread lotsofcolumns.csv with no issues.

mcjmigdal commented 1 year ago

Cool thanks!