pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.59k stars 1.89k forks source link

Missing documentation for how various formats are turned into polars dataframes #14636

Open michaeleisel opened 7 months ago

michaeleisel commented 7 months ago

Description

Polars has great support for lots of different formats, and it seems like it has picked some reasonable ways of turning those formats into dataframes. It would be great to document these choices, and the principles behind them, in some sort of way. One principle I've heard from others is that polars always losslessly converts various data types into their internal formats. This is a great principle that can answer many questions, but still leaves some areas of murkiness that would be good to document. I think it would also be good for the sake of explicitness to document even trivial conversions, just so the user is clear (e.g., a JSON string being turned into a polars string). But here are some examples of questions that maybe have less obvious answers:

What I would love to see, personally, is a table listing out each data type of each supported format and how it gets mapped to a polars data type. This is not at all a criticism of how polars' converts from input data types to polars data types, I just think it would be great to add more docs explaining it to newcomers like myself.

Link

No response

michaeleisel commented 7 months ago

Interestingly, polars seems to handle timestamps that aren't losslessly convertible into a timestamp with microseconds. Here we have a dataframe that I made in parquet with a timestamp value of 1 nanosecond:

>>> pl.read_parquet('a.parquet')
shape: (1, 1)
┌───────────────────────────────┐
│ datetime_ns                   │
│ ---                           │
│ datetime[ns]                  │
╞═══════════════════════════════╡
│ 1970-01-01 00:00:00.000000001 │
└───────────────────────────────┘

So, I wonder if there's an inaccuracy in https://docs.pola.rs/user-guide/concepts/data-types/overview/ when it describes Datetime as "internally represented as microseconds"