fread incorect index column (row names) [new feature]

mcjmigdal commented 1 year ago

Did you find a bug in datatable, or maybe the bug found you? When reading a file with large number of columns (~35,000) that also stores row names the index column is miss interpreted as a first column and the last column gets generic name (C0).
How to reproduce the bug? lotsofcolumns.csv df = dt.fread("lotsofcolumns.csv", sep="\t") # github wont allow tsv file

   | AAACCCAAGCGGGTAT-1_1  AAACCCAAGGTGATAT-1_1  AAACCCAAGTTTGAGA-1_1  …  TTTGTTGTCTACGCGG-1_4     C0
   | str32                                bool8                 bool8                    bool8  bool8
-- + --------------------  --------------------  --------------------     --------------------  -----
 0 | gene1                                    0                     0  …                     0      0
 1 | gene2                                    0                     0  …                     0      0
 2 | gene3                                    0                     0  …                     0      0
 3 | gene4                                    0                     0  …                     0      0
[4 rows x 34677 columns]

What was the expected behavior? First column in the data frame should be called Index and the last column should be named using the last value from the header.

Or this is what happens in R, not sure if this should happen automatically but at least an argument to force this behavior.

Warning message:
In data.table::fread("integrated_all_sct_counts.tsv") :
Detected 34676 column names but the data has 34677 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.

Your environment? Debian GNU/Linux 10 (buster) Python 3.8.2 datatable 1.0.0
Thank you for contributing, and sorry for the inconvenience. Thanks for providing this package! Sorry if I missed something obvious :)

oleksiyskononenko commented 1 year ago

In the case we autodetect a header row, all the values from that row become column names. In your file there is no “index” value in the header row, hence, datatable can’t get it from anywhere else.

Your file needs to start with “index” value in the case you want it to become the first column name.

oleksiyskononenko commented 1 year ago

I don’t think dt ever claimed to support row names, so this issue should be marked as a feature request and not a bug. For sure It is possible to do what R data.table is doing.

mcjmigdal commented 1 year ago

Thank you for quick response, I've changed the bug in the title to feature request. I thought it was a bug because of what I observed while playing with smaller dataset, I tried to recreate my experiments and it seems the behavior is different for different column separators.

data = "A B C\ngene1 1 2 3\ngene2 3 2 1"
dt.fread(data, sep=" ")
#    | index      A      B      C
#    | str32  int32  int32  int32
# -- + -----  -----  -----  -----
#  0 | gene1      1      2      3
#  1 | gene2      3      2      1
# [2 rows x 4 columns]
dt.fread(data.replace(" ", "\t"), sep="\t")
#    | A          B      C     C0
#    | str32  int32  int32  int32
# -- + -----  -----  -----  -----
#  0 | gene1      1      2      3
#  1 | gene2      3      2      1
# [2 rows x 4 columns]

oleksiyskononenko commented 1 year ago

Hm, that’s interesting. I will look into it, we indeed either should document this behavior, or do not support.

oleksiyskononenko commented 1 year ago

So the issue was not the number of columns, but the separator. For some reason we only supported space, when detected row names. Once #3455 is merged, you can fread lotsofcolumns.csv with no issues.

mcjmigdal commented 1 year ago

Cool thanks!

h2oai / datatable

fread incorect index column (row names) [new feature] #3453