Closed mcjmigdal closed 1 year ago
In the case we autodetect a header row, all the values from that row become column names. In your file there is no “index” value in the header row, hence, datatable can’t get it from anywhere else.
Your file needs to start with “index” value in the case you want it to become the first column name.
I don’t think dt ever claimed to support row names, so this issue should be marked as a feature request and not a bug. For sure It is possible to do what R data.table is doing.
Thank you for quick response, I've changed the bug in the title to feature request. I thought it was a bug because of what I observed while playing with smaller dataset, I tried to recreate my experiments and it seems the behavior is different for different column separators.
data = "A B C\ngene1 1 2 3\ngene2 3 2 1"
dt.fread(data, sep=" ")
# | index A B C
# | str32 int32 int32 int32
# -- + ----- ----- ----- -----
# 0 | gene1 1 2 3
# 1 | gene2 3 2 1
# [2 rows x 4 columns]
dt.fread(data.replace(" ", "\t"), sep="\t")
# | A B C C0
# | str32 int32 int32 int32
# -- + ----- ----- ----- -----
# 0 | gene1 1 2 3
# 1 | gene2 3 2 1
# [2 rows x 4 columns]
Hm, that’s interesting. I will look into it, we indeed either should document this behavior, or do not support.
So the issue was not the number of columns, but the separator. For some reason we only supported space, when detected row names. Once #3455 is merged, you can fread lotsofcolumns.csv
with no issues.
Cool thanks!
Did you find a bug in datatable, or maybe the bug found you? When reading a file with large number of columns (~35,000) that also stores row names the index column is miss interpreted as a first column and the last column gets generic name (
C0
).How to reproduce the bug? lotsofcolumns.csv
df = dt.fread("lotsofcolumns.csv", sep="\t")
# github wont allow tsv fileWhat was the expected behavior? First column in the data frame should be called Index and the last column should be named using the last value from the header.
Or this is what happens in R, not sure if this should happen automatically but at least an argument to force this behavior.
Your environment? Debian GNU/Linux 10 (buster) Python 3.8.2 datatable 1.0.0
Thank you for contributing, and sorry for the inconvenience. Thanks for providing this package! Sorry if I missed something obvious :)