pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.13k stars 1.83k forks source link

read_csv dypes has no effect #3891

Closed seasonedfish closed 2 years ago

seasonedfish commented 2 years ago

What language are you using?

Python

Have you tried latest version of polars?

What version of polars are you using?

0.13.51

What operating system are you using polars on?

macOS 12.4 (M1, 2020)

What language version are you using

Python 3.10

Describe your bug.

Sorry for the newbie problem, but I can't seem to get the dtypes parameter to work for read_csv. The output DataFrame doesn't have the dtypes I specified.

I saw https://github.com/pola-rs/polars/issues/1492, so I passed a list to dtype, to no avail.

What are the steps to reproduce the behavior?

short.tsv

"id"    "type"  "number"    "country"   "date"  "abstract"  "title" "kind"  "num_claims"    "filename"  "withdrawn"
"10000000"  "utility"   "10000000"  "US"    "2018-06-19"    "A frequency modulated (coherent) laser detection and ranging system includes a read-out integrated circuit formed with a two-dimensional array of detector elements each including a photosensitive region receiving both return light reflected from a target and light from a local oscillator, and local processing circuitry sampling the output of the photosensitive region four times during each sample period clock cycle to obtain quadrature components. A data bus coupled to one or more outputs of each of the detector elements receives the quadrature components from each of the detector elements for each sample period and serializes the received quadrature components. A processor coupled to the data bus receives the serialized quadrature components and determines an amplitude and a phase for at least one interfering frequency corresponding to interference between the return light and the local oscillator light using the quadrature components."    "Coherent LADAR using intra-pixel quadrature detection" "B2"    20  "ipg180619.xml" 0
"10000001"  "utility"   "10000001"  "US"    "2018-06-19"    "The injection molding machine includes a fixed platen, a moveable platen moving forward and backward by a toggle link, a base plate supporting the toggle link, a driving part for mold clamping to operate the toggle link, a driving part for mold thickness adjustment to adjust a mold thickness, and a control unit to calculate a movement distance gap before a clamping process by controlling the driving part for mold thickness adjustment to move the base plate backward and then move the base plate forward to a target movement position based on a fold amount of the toggle link, and control the driving part for mold thickness adjustment using a value obtained by deducting the movement distance gap from the fold amount of the toggle link when producing a clamp force."    "Injection molding machine and mold thickness control method"   "B2"    12  "ipg180619.xml" 0
"10000002"  "utility"   "10000002"  "US"    "2018-06-19"    "The present invention relates to: a method for manufacturing a polymer film, the method including a base film forming step for co-extruding a first resin containing a polyamide-based resin and a second resin containing a copolymer including polyamide-based segments and polyether-based segments; a co-extruded film including a base film including a first resin layer containing a polyamide-based resin, and a second resin layer containing a copolymer having polyamide-based segments and polyether-based segments; to a co-extruded film including a base film including a first resin layer and a second resin layer, which have different melting points; and to a method for manufacturing a polymer film, the method including a base film forming step including a step of co-extruding a first resin and a second resin, which have different melting points." "Method for manufacturing polymer film and co-extruded film"    "B2"    9   "ipg180619.xml" 0
"10000003"  "utility"   "10000003"  "US"    "2018-06-19"    "The invention relates to a method for producing a container (2) from a thermoplastic, having at least one surround (4), provided in the container wall (1), for a container opening. The surround (4) comprises a structure behind which parts of the container wall (1) extend and/or which is penetrated by said parts. The method is carried out using a multi-part blow mold that has at least two mold parts, each having at least one cavity, wherein the surround is placed as an insert in the cavity (10) of the blow mold (7). The method comprises pressing the preform that has been forced into the cavity (10) into the structure of the surround (4) by means of a tool which is brought to bear on the preform (12) on the side of the preform facing away from the cavity (10)."  "Method for producing a container from a thermoplastic" "B2"    18  "ipg180619.xml" 0
"10000004"  "utility"   "10000004"  "US"    "2018-06-19"    "The present invention relates to provides a double-oriented film, co-extrude, and of low thickness, with a layered composition that gives the property of being of high barrier to gases and manufactured by the process of co-extrusion of 3 bubbles, which gives the property of when being thermoformed, ensure the distribution of uniform thickness in the walls, base, folds, and corners of the formed tray saving a minimum of 50% of plastic without diminishing its gas barrier and its resistance to puncture." "Process of obtaining a double-oriented film, co-extruded, and of low thickness made by a three bubble process that at the time of being thermoformed provides a uniform thickness in the produced tray"    "B2"    6   "ipg180619.xml" 0
"10000005"  "utility"   "10000005"  "US"    "2018-06-19"    "A vacuum forming apparatus is provided that forms an article having a covering bonded to the surface of a substrate in a molding space using a first mold and a second mold. The vacuum forming apparatus is provided with clamps for grasping the covering between the first and second molds arranged at the open positions. The clamps are movable between an interfering position, at which the clamps are located in the movement ranges of the first and second molds, and standby positions, at which the clamps are outside the movement ranges. After the covering is heated, the clamps grasping the covering move to the standby positions and stretch the covering. The first and second molds move to the closed positions and the article is molded between the first and second molds so that the stretched covering and the substrate are bonded to each other."   "Article vacuum formation method and vacuum forming apparatus"  "B2"    4   "ipg180619.xml" 0
"10000006"  "utility"   "10000006"  "US"    "2018-06-19"    "A thermoforming mold device (1) providing a piece with a thin wall starting with a sheet of thermoplastic material is provided. At least one (3) of two parts of the mold (3, 3′) comprises at least one means (4) of local deformation of a sheet (2′) in the mold (3, 3′) in its closed state, the at least one means (4) comprises a piece of hollow molding with a peripheral edge, which can be connected selectively to a source of suction and can be displaced between a folded position, in which the molding piece is situated in close proximity with the wall of the thermoformed piece, and a deployed position, in which the molding piece is applied under pressure with its peripheral edge against the wall of the thermoformed piece upholding the other part of the mold."  "Thermoforming mold device and a process for its manufacture and use"   "B2"    8   "ipg180619.xml" 0
"10000007"  "utility"   "10000007"  "US"    "2018-06-19"    "An expanding tool comprising: an actuator comprising a cylindrical housing that defines an actuator housing cavity; a primary ram disposed within the actuator housing cavity, the primary ram defining an internal primary ram cavity; a secondary ram disposed within the internal primary ram cavity; a cam roller carrier coupled to a distal end of the secondary ram; a drive collar positioned within a distal end of the actuator housing cavity; a roller clutch disposed within an internal cavity defined by an inner surface of the drive collar; a shuttle cam positioned between the roller clutch and a distal end of the primary ram; an expander cone coupled to the primary ram; and an expander head operably coupled to the drive collar." "PEX expanding tool"    "B2"    24  "ipg180619.xml" 0
"D242583"   "utility"   "10000008"  "US"    "2018-06-19"    "A decorated strip of coated, heat-shrinkable, plastic sheet material is placed in a spiral slot formed in a silicone rubber mold. The spiral slot is defined by a spiral wall having a uniform wall thickness. Upon heating in an oven, the material shrinks, forming a resiliently expansible arc-shaped band that can be worn as a bracelet or wristband."   "Bracelet mold and method of use"   "B2"    11  "ipg180619.xml" 0

__main__.py

import polars as pl

patent_df = pl.read_csv(
    file="short.tsv",
    sep="\t",
    columns=[0, 2, 4],
    dypes=[
        pl.Utf8,
        pl.Int64,
        pl.Date
    ]
)

print(patent_df)

What is the actual behavior?

┌──────────┬──────────┬────────────┐
│ id       ┆ number   ┆ date       │
│ ---      ┆ ---      ┆ ---        │
│ str      ┆ i64      ┆ str        │
╞══════════╪══════════╪════════════╡
...

What is the expected behavior?

id should be of type pl.Utf8, and date of type pl.Date.

ghuls commented 2 years ago

pl.Utf8 is the same as str.

You have a typo in your argument. It should be dtypes instead of dytes

# dtypes are done before the column selection, so pl.Date is applied to the third column ("number") instead of the one you want.
In [8]: patent_df = pl.read_csv(
   ...:     file="test_override_dtypes.tsv",
   ...:     sep="\t",
   ...:     columns=[0, 2, 4],
   ...:     dtypes=[
   ...:         pl.Utf8,
   ...:         pl.Int32,
   ...:         pl.Date
   ...:     ]
   ...: )
   ...: 
   ...: print(patent_df)
shape: (9, 3)
┌──────────┬────────┬────────────┐
│ id       ┆ number ┆ date       │
│ ---      ┆ ---    ┆ ---        │
│ str      ┆ date   ┆ str        │
╞══════════╪════════╪════════════╡
│ 10000000 ┆ null   ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000001 ┆ null   ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000002 ┆ null   ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000003 ┆ null   ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...      ┆ ...    ┆ ...        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000005 ┆ null   ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000006 ┆ null   ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000007 ┆ null   ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ D242583  ┆ null   ┆ 2018-06-19 │
└──────────┴────────┴────────────┘

# Use a dictionary.
In [9]: patent_df = pl.read_csv(
   ...:     file="test_override_dtypes.tsv",
   ...:     sep="\t",
   ...:     columns=[0, 2, 4],
   ...:     dtypes={
   ...:         "id": pl.Utf8,
   ...:         "number": pl.Int32,
   ...:         "date": pl.Date
   ...:     }
   ...: )
   ...: 
   ...: print(patent_df)
shape: (9, 3)
┌──────────┬──────────┬────────────┐
│ id       ┆ number   ┆ date       │
│ ---      ┆ ---      ┆ ---        │
│ str      ┆ i32      ┆ date       │
╞══════════╪══════════╪════════════╡
│ 10000000 ┆ 10000000 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000001 ┆ 10000001 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000002 ┆ 10000002 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000003 ┆ 10000003 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...      ┆ ...      ┆ ...        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000005 ┆ 10000005 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000006 ┆ 10000006 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000007 ┆ 10000007 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ D242583  ┆ 10000008 ┆ 2018-06-19 │
└──────────┴──────────┴────────────┘
ghuls commented 2 years ago

@ritchie46 pl.Date doesn't seem to work, when provided as a dtypes list, but works when using a dtypes dict.

In [25]: patent_df = pl.read_csv(
    ...:     file="test_override_dtypes.tsv",
    ...:     sep="\t",
    ...:     columns=[0, 2, 4],
    ...:     dtypes=[
    ...:         pl.Utf8,
    ...:         pl.Utf8,
    ...:         pl.Int32,
    ...:         pl.Utf8,
    ...:         pl.Date,
    ...:     ]
    ...: )
    ...: 
    ...: print(patent_df)
shape: (9, 3)
┌──────────┬──────────┬──────┐
│ id       ┆ number   ┆ date │
│ ---      ┆ ---      ┆ ---  │
│ str      ┆ i32      ┆ date │
╞══════════╪══════════╪══════╡
│ 10000000 ┆ 10000000 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000001 ┆ 10000001 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000002 ┆ 10000002 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000003 ┆ 10000003 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...      ┆ ...      ┆ ...  │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000005 ┆ 10000005 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000006 ┆ 10000006 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000007 ┆ 10000007 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ D242583  ┆ 10000008 ┆ null │
└──────────┴──────────┴──────┘
ritchie46 commented 2 years ago

Could it be that we first assign column types and then do the projection?

ghuls commented 2 years ago

Could it be that we first assign column types and then do the projection?

Yes, that is what happened for now.

It is now partially fixed (except for pl.Date with column indices):

In [3]: patent_df = pl.read_csv(
   ...:     file="test_override_dtypes.tsv",
   ...:     sep="\t",
   ...:     columns=[0, 2, 4],
   ...:     dtypes=[
   ...:         pl.Utf8,
   ...:         pl.Int32,
   ...:         pl.Date
   ...:     ]
   ...: )
   ...: 
   ...: print(patent_df)
shape: (9, 3)
┌──────────┬──────────┬──────┐
│ id       ┆ number   ┆ date │
│ ---      ┆ ---      ┆ ---  │
│ str      ┆ i32      ┆ date │
╞══════════╪══════════╪══════╡
│ 10000000 ┆ 10000000 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000001 ┆ 10000001 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000002 ┆ 10000002 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000003 ┆ 10000003 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...      ┆ ...      ┆ ...  │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000005 ┆ 10000005 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000006 ┆ 10000006 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10000007 ┆ 10000007 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ D242583  ┆ 10000008 ┆ null │
└──────────┴──────────┴──────┘

In [4]: patent_df = pl.read_csv(
   ...:     file="test_override_dtypes.tsv",
   ...:     sep="\t",
   ...:     columns=["id", "number", "date"],
   ...:     dtypes=[
   ...:         pl.Utf8,
   ...:         pl.Int32,
   ...:         pl.Date
   ...:     ]
   ...: )
   ...: 
   ...: print(patent_df)
shape: (9, 3)
┌──────────┬──────────┬────────────┐
│ id       ┆ number   ┆ date       │
│ ---      ┆ ---      ┆ ---        │
│ str      ┆ i32      ┆ date       │
╞══════════╪══════════╪════════════╡
│ 10000000 ┆ 10000000 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000001 ┆ 10000001 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000002 ┆ 10000002 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000003 ┆ 10000003 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...      ┆ ...      ┆ ...        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000005 ┆ 10000005 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000006 ┆ 10000006 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10000007 ┆ 10000007 ┆ 2018-06-19 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ D242583  ┆ 10000008 ┆ 2018-06-19 │
└──────────┴──────────┴────────────┘
seasonedfish commented 2 years ago

You have a typo in your argument. It should be dtypes instead of dytes

pl.Date doesn't seem to work, when provided as a dtypes list, but works when using a dtypes dict.

Ah, I see. I originally used a dict, but I guess that didn't work because I had the typo 🤦‍♂️

I'm glad we were able to spot and fix the list issue from this though. Thank you for your trouble!