Closed seasonedfish closed 2 years ago
pl.Utf8
is the same as str
.
You have a typo in your argument. It should be dtypes
instead of dytes
# dtypes are done before the column selection, so pl.Date is applied to the third column ("number") instead of the one you want.
In [8]: patent_df = pl.read_csv(
...: file="test_override_dtypes.tsv",
...: sep="\t",
...: columns=[0, 2, 4],
...: dtypes=[
...: pl.Utf8,
...: pl.Int32,
...: pl.Date
...: ]
...: )
...:
...: print(patent_df)
shape: (9, 3)
┌──────────┬────────┬────────────┐
│ id ┆ number ┆ date │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ str │
╞══════════╪════════╪════════════╡
│ 10000000 ┆ null ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000001 ┆ null ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000002 ┆ null ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000003 ┆ null ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000005 ┆ null ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000006 ┆ null ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000007 ┆ null ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ D242583 ┆ null ┆ 2018-06-19 │
└──────────┴────────┴────────────┘
# Use a dictionary.
In [9]: patent_df = pl.read_csv(
...: file="test_override_dtypes.tsv",
...: sep="\t",
...: columns=[0, 2, 4],
...: dtypes={
...: "id": pl.Utf8,
...: "number": pl.Int32,
...: "date": pl.Date
...: }
...: )
...:
...: print(patent_df)
shape: (9, 3)
┌──────────┬──────────┬────────────┐
│ id ┆ number ┆ date │
│ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ date │
╞══════════╪══════════╪════════════╡
│ 10000000 ┆ 10000000 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000001 ┆ 10000001 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000002 ┆ 10000002 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000003 ┆ 10000003 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000005 ┆ 10000005 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000006 ┆ 10000006 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000007 ┆ 10000007 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ D242583 ┆ 10000008 ┆ 2018-06-19 │
└──────────┴──────────┴────────────┘
@ritchie46 pl.Date doesn't seem to work, when provided as a dtypes list, but works when using a dtypes dict.
In [25]: patent_df = pl.read_csv(
...: file="test_override_dtypes.tsv",
...: sep="\t",
...: columns=[0, 2, 4],
...: dtypes=[
...: pl.Utf8,
...: pl.Utf8,
...: pl.Int32,
...: pl.Utf8,
...: pl.Date,
...: ]
...: )
...:
...: print(patent_df)
shape: (9, 3)
┌──────────┬──────────┬──────┐
│ id ┆ number ┆ date │
│ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ date │
╞══════════╪══════════╪══════╡
│ 10000000 ┆ 10000000 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000001 ┆ 10000001 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000002 ┆ 10000002 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000003 ┆ 10000003 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000005 ┆ 10000005 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000006 ┆ 10000006 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000007 ┆ 10000007 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ D242583 ┆ 10000008 ┆ null │
└──────────┴──────────┴──────┘
Could it be that we first assign column types and then do the projection?
Could it be that we first assign column types and then do the projection?
Yes, that is what happened for now.
It is now partially fixed (except for pl.Date with column indices):
In [3]: patent_df = pl.read_csv(
...: file="test_override_dtypes.tsv",
...: sep="\t",
...: columns=[0, 2, 4],
...: dtypes=[
...: pl.Utf8,
...: pl.Int32,
...: pl.Date
...: ]
...: )
...:
...: print(patent_df)
shape: (9, 3)
┌──────────┬──────────┬──────┐
│ id ┆ number ┆ date │
│ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ date │
╞══════════╪══════════╪══════╡
│ 10000000 ┆ 10000000 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000001 ┆ 10000001 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000002 ┆ 10000002 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000003 ┆ 10000003 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000005 ┆ 10000005 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000006 ┆ 10000006 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000007 ┆ 10000007 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ D242583 ┆ 10000008 ┆ null │
└──────────┴──────────┴──────┘
In [4]: patent_df = pl.read_csv(
...: file="test_override_dtypes.tsv",
...: sep="\t",
...: columns=["id", "number", "date"],
...: dtypes=[
...: pl.Utf8,
...: pl.Int32,
...: pl.Date
...: ]
...: )
...:
...: print(patent_df)
shape: (9, 3)
┌──────────┬──────────┬────────────┐
│ id ┆ number ┆ date │
│ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ date │
╞══════════╪══════════╪════════════╡
│ 10000000 ┆ 10000000 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000001 ┆ 10000001 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000002 ┆ 10000002 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000003 ┆ 10000003 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000005 ┆ 10000005 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000006 ┆ 10000006 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000007 ┆ 10000007 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ D242583 ┆ 10000008 ┆ 2018-06-19 │
└──────────┴──────────┴────────────┘
You have a typo in your argument. It should be
dtypes
instead ofdytes
pl.Date doesn't seem to work, when provided as a dtypes list, but works when using a dtypes dict.
Ah, I see. I originally used a dict, but I guess that didn't work because I had the typo 🤦♂️
I'm glad we were able to spot and fix the list issue from this though. Thank you for your trouble!
What language are you using?
Python
Have you tried latest version of polars?
What version of polars are you using?
0.13.51
What operating system are you using polars on?
macOS 12.4 (M1, 2020)
What language version are you using
Python 3.10
Describe your bug.
Sorry for the newbie problem, but I can't seem to get the
dtypes
parameter to work forread_csv
. The output DataFrame doesn't have the dtypes I specified.I saw https://github.com/pola-rs/polars/issues/1492, so I passed a list to
dtype
, to no avail.What are the steps to reproduce the behavior?
short.tsv
__main__.py
What is the actual behavior?
What is the expected behavior?
id
should be of typepl.Utf8
, anddate
of typepl.Date
.