jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.07k stars 223 forks source link

Reduce the overhead of `DataType`s #1469

Open teh-cmc opened 1 year ago

teh-cmc commented 1 year ago

Fixes #439

The entire PR pretty much comes down to this diff:

@@ -70,7 +111,7 @@ pub enum DataType {
     /// * An absolute time zone offset of the form +XX:XX or -XX:XX, such as +07:30
     /// When the timezone is not specified, the timestamp is considered to have no timezone
     /// and is represented _as is_
-    Timestamp(TimeUnit, Option<String>),
+    Timestamp(TimeUnit, Option<Arc<String>>),
     /// An [`i32`] representing the elapsed time since UNIX epoch (1970-01-01)
     /// in days.
     Date32,
@@ -100,16 +141,16 @@ pub enum DataType {
     /// A variable-length UTF-8 encoded string whose offsets are represented as [`i64`].
     LargeUtf8,
     /// A list of some logical data type whose offsets are represented as [`i32`].
-    List(Box<Field>),
+    List(Arc<Field>),
     /// A list of some logical data type with a fixed number of elements.
-    FixedSizeList(Box<Field>, usize),
+    FixedSizeList(Arc<Field>, usize),
     /// A list of some logical data type whose offsets are represented as [`i64`].
-    LargeList(Box<Field>),
+    LargeList(Arc<Field>),
     /// A nested [`DataType`] with a given number of [`Field`]s.
-    Struct(Vec<Field>),
+    Struct(Arc<Vec<Field>>),
     /// A nested datatype that can represent slots of differing types.
     /// Third argument represents mode
-    Union(Vec<Field>, Option<Vec<i32>>, UnionMode),
+    Union(Arc<Vec<Field>>, Option<Arc<Vec<i32>>>, UnionMode),
     /// A nested type that is represented as
     ///
     /// List<entries: Struct<key: K, value: V>>
@@ -135,7 +176,7 @@ pub enum DataType {
     /// The metadata is structured so that Arrow systems without special handling
     /// for Map can make Map an alias for List. The "layout" attribute for the Map
     /// field must have the same contents as a List.
-    Map(Box<Field>, bool),
+    Map(Arc<Field>, bool),
     /// A dictionary encoded array (`key_type`, `value_type`), where
     /// each array element is an index of `key_type` into an
     /// associated dictionary of `value_type`.
@@ -148,7 +189,7 @@ pub enum DataType {
     /// arrays or a limited set of primitive types as integers.
     ///
     /// The `bool` value indicates the `Dictionary` is sorted if set to `true`.
-    Dictionary(IntegerType, Box<DataType>, bool),
+    Dictionary(IntegerType, Arc<DataType>, bool),
     /// Decimal value with precision and scale
     /// precision is the number of digits in the number and
     /// scale is the number of decimal places.
@@ -157,12 +198,15 @@ pub enum DataType {
     /// Decimal backed by 256 bits
     Decimal256(usize, usize),
     /// Extension type.
-    Extension(String, Box<DataType>, Option<String>),
+    Extension(String, Arc<DataType>, Option<Arc<String>>),
 }

everything else is just a lot of grunt work and pain to accommodate for these new types.

As mentioned in https://github.com/jorgecarleitao/arrow2/issues/439#issuecomment-1506984699: I went for the path of least resistance, so this isn't optimal, but it is already quite the improvement.

I have branches ready for arrow2_convert, polars and rerun. In Rerun, we've seen up to 50% reduced memory requirements in some use cases with this PR.

ritchie46 commented 1 year ago

Thanks for the PR. This hits a lot of code so I want to tune in @jorgecarleitao as well.

In Rerun, we've seen up to 50% reduced memory requirements in some use cases with this PR.

Could you elaborate a bit on this? What did you benchmark? How was the data type such a huge bottleneck?

What is the new data type size and what was the old one?

Somewhat related, In polars we use smallstring which might also be an interesting route.

codecov[bot] commented 1 year ago

Codecov Report

Attention: 16 lines in your changes are missing coverage. Please review.

Comparison is base (3ddc6a1) 83.39% compared to head (3c3c6ed) 83.96%.

:exclamation: Current head 3c3c6ed differs from pull request most recent head 40541b4. Consider uploading reports for the commit 40541b4 to get more accurate results

Files Patch % Lines
src/datatypes/mod.rs 89.74% 4 Missing :warning:
src/io/avro/read/schema.rs 50.00% 3 Missing :warning:
src/array/struct_/mod.rs 0.00% 2 Missing :warning:
src/compute/cast/dictionary_to.rs 0.00% 2 Missing :warning:
src/io/parquet/read/schema/convert.rs 98.07% 2 Missing :warning:
src/temporal_conversions.rs 50.00% 2 Missing :warning:
src/io/parquet/write/mod.rs 50.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1469 +/- ## ========================================== + Coverage 83.39% 83.96% +0.57% ========================================== Files 391 387 -4 Lines 43008 41739 -1269 ========================================== - Hits 35867 35048 -819 + Misses 7141 6691 -450 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

emilk commented 8 months ago

Any chance to get this merged? We do a lot of cloning of datatypes, and the memory use adds up extremely quickly. Using Arc reduces the memory footprint tremendously.