pola-rs / r-polars

Bring polars to R
https://pola-rs.github.io/r-polars/
Other
415 stars 36 forks source link

Export ArrowArrayStream from polars data frame #5

Closed paleolimbot closed 1 year ago

paleolimbot commented 1 year ago

Building on the excellent experiments of @sorhawell in https://github.com/rpolars/rpolars/issues/4 and https://github.com/rpolars/rpolars/compare/main...nanoarrow , this is an attempt to export data frames to the Arrow C Stream interface.

This doesn't compile yet, of course, but hopefully somebody who actually does Rust can help here! The error really does hit at the crux of the matter, which is that the DataFrame has to outlive the stream. In C++ one could do something like this using shared pointers and a virtual deleter...I'm lost as to how this should be done here but am totally willing to learn!

wjones127 commented 1 year ago

So the problem is the iterator borrows the DataFrame; instead you'll want to create an iterator that owns the DataFrame. (The Rust equivalent of shared_ptr is Arc, but it looks like Polars DataFrames are mutable so instead of using Arcs they are cheaply copy-able.)

So might need to define a struct like:

struct OwnedDataFrameIterator {
  df: pl.DataFrame,
  iter: polars::frame::RecordBatchIter
}

impl OwnedDataFrameIterator {
  fn new(df: pl.DataFrame) -> Self {
    Self { df, iter: df.iter_chunks() }
  }
}

impl Iterator for OwnedDataFrameIterator {
  type Item = Result<Box<dyn Array>, arrow::error::Error>;

  fn next(&mut self) -> Self::Item {
    self.iter.next()
  }
}
sorhawell commented 1 year ago

I must admit I'm a complete rookie when it comes to the Arrow interface.

I think similar to @wjones127 suggestion, you could collect arrowArray into a vector which is owned and then pass it on. I have added collect(), a rechunk(), added a clone() and removed a 'move' . The method below will compile.

What would happen to the swapped stream pointer if the DataFrame memory is dropped on Rust side? Let's find out :) Otherwise we can export also a DataFrame clone to protect memory allocation.

pub fn export_stream(&mut self, stream_ptr: &str) {
        let schema = self.0.schema().to_arrow();
        let data_type = DataType::Struct(schema.fields);
        let field = ArrowField::new("", data_type.clone(), false);

        self.0.rechunk(); //avoids panic if series' are chunked, see iter_chunks() doc
        let df = &self.0;
        let chunk_vec: Vec<_> = df
            .iter_chunks()
            .map(
                |item| -> Result<Box<dyn arrow::array::Array>, arrow::error::Error> {
                    let array = arrow::array::StructArray::new(
                        data_type.clone(),
                        item.into_arrays(),
                        std::option::Option::None,
                    );
                    Ok(Box::new(array))
                },
            )
            .collect();

        let chunk_vec_boxed = Box::new(chunk_vec.into_iter());

        let mut stream = arrow::ffi::export_iterator(chunk_vec_boxed, field);
        let stream_out_ptr_addr: usize = stream_ptr.parse().unwrap();
        let stream_out_ptr = stream_out_ptr_addr as *mut arrow::ffi::ArrowArrayStream;
        unsafe {
            std::ptr::swap_nonoverlapping(
                stream_out_ptr,
                &mut stream as *mut arrow::ffi::ArrowArrayStream,
                1,
            );
        }
    }
paleolimbot commented 1 year ago

Thanks to you both!

I'm giving Will's a shot first because I'd like to know for my own benefit how to do this kind of thing (and because I imagine it will translate more directly to exporting a reader from a lazy frame which is what I'm really excited about). I have

struct OwnedDataFrameIterator<'a> {
    df: polars::frame::DataFrame,
    iter: polars::frame::RecordBatchIter<'a>,
    data_type: arrow::datatypes::DataType
}

impl OwnedDataFrameIterator<'_> {
    fn new(df: polars::frame::DataFrame ) -> Self {
        let schema = df.schema().to_arrow();
        let data_type = DataType::Struct(schema.fields);
        let iter = polars::frame::RecordBatchIter {
            columns: df.get_columns(),
            idx: 0,
            n_chunks: df.n_chunks().unwrap(),
        };

        Self { df, iter, data_type }
    }
}

impl Iterator for OwnedDataFrameIterator<'_> {
    type Item = Result<Box<dyn arrow::array::Array>, arrow::error::Error>;

    fn next(&mut self) -> Option<Self::Item> {
        let item = self.iter.next();
        match item {
            std::option::Option::Some(i) => {
                let array = arrow::array::StructArray::new(self.data_type.clone(), i.into_arrays(), std::option::Option::None);
                Some(std::result::Result::Ok(Box::new(array)))
            }
            _ => None
        }
    }
}

...which almost compiles except for:

error[E0515]: cannot return value referencing function parameter `df`
  --> src/rdataframe/mod.rs:40:9
   |
35 |             columns: df.get_columns(),
   |                      ---------------- `df` is borrowed here
...
40 |         Self { df, iter, data_type }
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ returns a value referencing data owned by the current function
   |
   = help: use `.collect()` to allocate the iterator

error[E0505]: cannot move out of `df` because it is borrowed
  --> src/rdataframe/mod.rs:40:16
   |
31 |     fn new(df: polars::frame::DataFrame ) -> Self {
   |                                              ---- return type is OwnedDataFrameIterator<'1>
...
35 |             columns: df.get_columns(),
   |                      ---------------- borrow of `df` occurs here
...
40 |         Self { df, iter, data_type }
   |         -------^^-------------------
   |         |      |
   |         |      move out of `df` occurs here
   |         returning this value requires that `df` is borrowed for `'1`

Some errors have detailed explanations: E0505, E0515.
For more information about an error, try `rustc --explain E0505`.
error: could not compile `rpolars` due to 2 previous errors

I have a feeling I'm missing a lifetime specifier somewhere but I don't know where to put it!

wjones127 commented 1 year ago

Okay I think I led you astray just a little. I forgot that you can't have self-referential structs in Rust. Basically references are just pointers, and Rust doesn't guarantee it won't move around your struct, invalidating the pointer.

So this means that you can't use polars::frame::RecordBatchIter, but instead need to create a modified version of it's implementation (haven't tested, but roughly correct):

pub struct OwnedDataFrameIterator {
    columns: Vec<Series>,
    data_type: arrow::datatypes::DataType,
    idx: usize,
    n_chunks: usize,
}

impl OwnedDataFrameIterator {
    fn new(df: polars::frame::DataFrame ) -> Self {
        let schema = df.schema().to_arrow();
        let data_type = DataType::Struct(schema.fields);

        Self { 
            columns: df.get_columns().clone(),
            data_type,
            idx: 0,
            n_chunks: df.n_chunks().unwrap()
        }
    }
}

impl Iterator for OwnedDataFrameIterator<'_> {
    type Item = Result<Box<dyn arrow::array::Array>, arrow::error::Error>;

    fn next(&mut self) -> Option<Self::Item> {
        if self.idx >= self.n_chunks {
            None
        } else {
            // create a batch of the columns with the same chunk no.
            let batch_cols = self.columns.iter().map(|s| s.to_arrow(self.idx)).collect();
            self.idx += 1;

            let chunk = ArrowChunk::new(batch_cols));
            let array = arrow::array::StructArray::new(self.data_type.clone(), chunk.into_arrays(), std::option::Option::None);
            Some(std::result::Result::Ok(Box::new(array)))
        }
    }
}
paleolimbot commented 1 year ago

Thanks Will! It compiles!!

Now I have:

> df = pl$DataFrame(iris)
> stream = nanoarrow::nanoarrow_allocate_array_stream()
> df$export_stream(nanoarrow::nanoarrow_pointer_addr_chr(stream))
Error: syntax error: export_stream is not a method/attribute of the class DataFrame 
 when calling:
 df$export_stream

(I'm sure it's a really dumb error!)

sorhawell commented 1 year ago

Not at all :) Some context

In py-polars the DataFrame have four class levels (same for Series, Expr, LazyFrame, ...):

rpolars used to have a fourth R6 class, but it lead to more boilerplate code and heavier objects, not just an external pointer.

Instead the external pointer has a private set of methods which are the extendr-wrappers and public set of methods derived from the private methods and pure R functions.

You can access the private functions via .pr (pr for private) which is the root namespace of all private methods. Notice the private function are made into pure functions which take a DataFrame as argument.

df = pl$DataFrame(iris)

#print with private external method
.pr$DataFrame$print(df) # #here I use the private print method

#print with public internal method
df$print()

#imlementation of public method
> df$print
function() {
  .pr$DataFrame$print(self) #self is S3/extendr-magic and refers to the lhs of $ thus the DataFrame externalpointer the method is called from.
  invisible(self)
}

#print with S3 method
print(DataFrame)

I tried to doc the classes and API here: https://rpolars.github.io/reference/DataFrame_class.html https://rpolars.github.io/reference/index.html#rpolars-api-and-namespace https://rpolars.github.io/reference/pl.html https://rpolars.github.io/reference/dot-pr.html

You can immediately use .pr$DataFrame$export_stream(df) for testing

You need to implement the function called exactly

DataFrame_export_stream = function() {
  #some impl details
  .pr$DataFrame$export_stream(self,...)  
}

likely you would place it in R/dataframe__frame.R

Then rebuild and the public function is available

sorhawell commented 1 year ago

all these shenanigans is to get syntax which is as identical as possible to py-polars also the implementation code should look the same e.g. expr__expr.R is named liked that because its mirror file is found in py-polars/polars/internals/expr/expr.R. The most method implementations are in the same order and look very similar. The docs look similar if did not straight copy paste :)

paleolimbot commented 1 year ago

Beauty!

library(rpolars)

df <- pl$DataFrame(nycflights13::flights)

bench::mark(
  as.data.frame(df),
  nanoarrow = {
    stream <- df$export_stream()
    nanoarrow::convert_array_stream(stream, size = df$shape[1])
  }
)
#> # A tibble: 2 × 6
#>   expression             min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 as.data.frame(df)   46.1ms   47.6ms      21.0    38.7MB     84.0
#> 2 nanoarrow           22.5ms   24.6ms      40.8    38.7MB     18.6

Created on 2023-01-06 with reprex v2.0.2

paleolimbot commented 1 year ago

A few more comparisons:

library(arrow, warn.conflicts = FALSE)
library(rpolars)

df <- pl$DataFrame(nycflights13::flights)
n <- df$shape[1]

bench::mark(
  as.data.frame(df),
  nanoarrow = {
    stream <- df$export_stream()
    nanoarrow::convert_array_stream(stream, size = n)
  },
  # much faster because strings are never materialized to R
  arrow_table = {
    stream <- df$export_stream()
    reader <- arrow::as_record_batch_reader(stream)
    arrow::as_arrow_table(reader)
  },
  # much faster because of ALTREP chunked arrays for strings
  arrow_df = {
    stream <- df$export_stream()
    reader <- arrow::as_record_batch_reader(stream)
    as.data.frame(arrow::as_arrow_table(reader))
  },
  # with materializing strings
  arrow_df = {
    stream <- df$export_stream()
    reader <- arrow::as_record_batch_reader(stream)
    as.data.frame(arrow::as_arrow_table(reader))[n:1, ]
  },
  check = FALSE
)
#> # A tibble: 5 × 6
#>   expression             min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 as.data.frame(df)   46.8ms   47.1ms      21.1    38.7MB     31.6
#> 2 nanoarrow           23.9ms   24.8ms      39.6   38.81MB     29.7
#> 3 arrow_table        238.5µs  249.9µs    3903.     1.61MB     24.8
#> 4 arrow_df           480.8µs  494.1µs    1995.   635.78KB     24.4
#> 5 arrow_df            41.3ms   41.5ms      24.1   65.64MB     80.4

Created on 2023-01-06 with reprex v2.0.2

sorhawell commented 1 year ago

Nice! Looking forward to try it out on Monday.

sorhawell commented 1 year ago

I tried to check for mem safety by dropping df after stream, but I fail to cause any errors. I guess that even though df is dropped, the lower level Arrow-arrays are not.

library(nycflights13)
library(rpolars)
library(nanoarrow)

df <- pl$DataFrame(nycflights13::flights)
n = df$shape[1]

#make stream
stream = df$export_stream()

# dropping df and GC
rm(df)
gc()

# all good, does not break anything
df_na = nanoarrow::convert_array_stream(stream, size = n)
sorhawell commented 1 year ago

@paleolimbot do you have any idea why this happens? The reverse order, arrow first then rpolars is fine

Restarting R session...

* Project '~/Documents/projs/r-polars' loaded. [renv 0.16.0]
> library(rpolars)
> library(arrow)
Error: package or namespace load failed for ‘arrow’:
 .onLoad failed in loadNamespace() for 'arrow', details:
  call: NULL
  error: syntax error: set_pointer is not a method/attribute of the class DataType 
 when calling:
 library(arrow)
In addition: Warning message:
In arrow__UnregisterRExtensionType(extension_name) :
  restarting interrupted promise evaluation
> 
paleolimbot commented 1 year ago

I see that too! My guess is that both arrow and rpolars implement [[ or $ for DataType. I think this wouldn't be a problem if they were both R6 (which I think is where the offending method lives for arrow)...I imagine you will have to rename any intersecting classes to avoid that problem. (My first hunch was a symbol collision between the two .so files, but I ran nm -g on both and couldn't find any in common).

paleolimbot commented 1 year ago

(It is rather rude of us to define like 10 million R6 class names in Arrow...we should have prefixed them somehow but we didn't know and that ship sailed a long time ago...)

paleolimbot commented 1 year ago

I redid this to use S3 methods (registered at runtime to avoid a hard dependency). I did some for arrow too, which means that you could pass these things directly into a bunch of Arrow functions and have it "just work" 🙂

library(arrow, warn.conflicts = FALSE)
library(nanoarrow)
library(rpolars)

df <- pl$DataFrame(nycflights13::flights)
as_nanoarrow_array_stream(df)
#> <nanoarrow_array_stream struct<year: int32, month: int32, day: int32, dep_time: int32, sched_dep_time: int32, dep_delay: double, arr_time: int32, sched_arr_time: int32, arr_delay: double, carrier: large_string, flight: int32, tailnum: large_string, origin: large_string, dest: large_string, air_time: double, distance: double, hour: double, minute: double, time_hour: double>>
#>  $ get_schema:function ()  
#>  $ get_next  :function (schema = x$get_schema(), validate = TRUE)  
#>  $ release   :function ()
format(infer_nanoarrow_schema(df))
#> [1] "<nanoarrow_schema struct<year: int32, month: int32, day: int32, dep_time: int32, sched_dep_time: int32, dep_delay: double, arr_time: int32, sched_arr_time: int32, arr_delay: double, carrier: large_string, flight: int32, tailnum: large_string, origin: large_string, dest: large_string, air_time: double, distance: double, hour: double, minute: double, time_hour: double>>"
as_record_batch_reader(df)
#> RecordBatchReader
#> year: int32
#> month: int32
#> day: int32
#> dep_time: int32
#> sched_dep_time: int32
#> dep_delay: double
#> arr_time: int32
#> sched_arr_time: int32
#> arr_delay: double
#> carrier: large_string
#> flight: int32
#> tailnum: large_string
#> origin: large_string
#> dest: large_string
#> air_time: double
#> distance: double
#> hour: double
#> minute: double
#> time_hour: double
as_arrow_table(df)
#> Table
#> 336776 rows x 19 columns
#> $year <int32>
#> $month <int32>
#> $day <int32>
#> $dep_time <int32>
#> $sched_dep_time <int32>
#> $dep_delay <double>
#> $arr_time <int32>
#> $sched_arr_time <int32>
#> $arr_delay <double>
#> $carrier <large_string>
#> $flight <int32>
#> $tailnum <large_string>
#> $origin <large_string>
#> $dest <large_string>
#> $air_time <double>
#> $distance <double>
#> $hour <double>
#> $minute <double>
#> $time_hour <double>

Created on 2023-01-10 with reprex v2.0.2

sorhawell commented 1 year ago

I will try to build the PR on a windows machine and set what the fail is about

sorhawell commented 1 year ago

I think nanoarrow is failing to build on windows at the moment. It is the same error via github runner or if I try to install nanoarrow on my old gamer pc.


> remotes::install_github("apache/arrow-nanoarrow/r", build = FALSE)
Downloading GitHub repo apache/arrow-nanoarrow@HEAD
Installing package into 'C:/Users/soren/AppData/Local/R/cache/R/renv/library/r-polars-276af647/R-4.2/x86_64-w64-mingw32'
(as 'lib' is unspecified)
* installing *source* package 'nanoarrow' ...
** using staged installation

   **********************************************
   WARNING: this package has a configure script
         It probably needs manual configuration
   **********************************************

** libs
gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c altrep.c -o altrep.o
In file included from altrep.c:26:
array.h:24:10: fatal error: nanoarrow.h: No such file or directory
   24 | #include "nanoarrow.h"
      |          ^~~~~~~~~~~~~
compilation terminated.
make: *** [C:/R/R-42~1.2/etc/x64/Makeconf:253: altrep.o] Error 1
ERROR: compilation failed for package 'nanoarrow'
* removing 'C:/Users/soren/AppData/Local/R/cache/R/renv/library/r-polars-276af647/R-4.2/x86_64-w64-mingw32/nanoarrow'
Warning messages:
1: In untar2(tarfile, files, list, exdir, restore_times) :
  skipping pax global extended headers
2: In untar2(tarfile, files, list, exdir, restore_times) :
  skipping pax global extended headers
3: In i.p(...) :
  installation of package 'C:/Users/soren/AppData/Local/Temp/Rtmp0uNYJK/remotes3088258725c6/apache-arrow-nanoarrow-848ffc5/r' had non-zero exit status
>
paleolimbot commented 1 year ago

Oh yeah...I almost certainly need configure.win since ./configure doesn't run on windows 🤦

sorhawell commented 1 year ago

Oh yeah...I almost certainly need configure.win since ./configure doesn't run on windows 🤦

Would that be some like rewriting configure and place it in a Makevars.win file?

paleolimbot commented 1 year ago

Traditionally that kind of thing is baked into src/Makevars.win instead of configure.win...in that repo the R package pulls nanoarrow.h from the parent directory to make sure everything is in sync; however, I've never tested install via remotes on Windows.

paleolimbot commented 1 year ago

Ok...this should be fixed from nanoarrow's side (I tested remotes::install_github("apache/arrow-nanoarrow") in VM and it worked).

sorhawell commented 1 year ago

very close I think remotes can reset permissions for configure but renv cannot. Maybe if you set permission for configure and configure.win to chmod 777 or something like that and it should work

PS C:\Users\soren\Documents\projs\r-polars> C:\R\R-4.2.2\bin\R.exe

R version 4.2.2 (2022-10-31 ucrt) -- "Innocent and Trusting"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

* Project '~/projs/r-polars' loaded. [renv 0.16.0]
* The project may be out of sync -- use `renv::status()` for more details.
[Previously saved workspace restored]

> renv::restore()
The following package(s) will be updated:

# CRAN ===============================
- arrow        [* -> 10.0.1]
- assertthat   [* -> 0.2.1]
- bit          [* -> 4.0.5]
- bit64        [* -> 4.0.5]

# GitHub =============================
- nanoarrow    [* -> apache/arrow-nanoarrow:r@HEAD]

Do you want to proceed? [y/N]: y
Retrieving 'https://api.github.com/repos/apache/arrow-nanoarrow/tarball/848ffc5d3f99dabbc3fdb225d42be6d47e7c5402' ...
        OK [downloaded 179.6 Kb in 0.7 secs]
Installing assertthat [0.2.1] ...
        OK [linked cache]
Installing bit [4.0.5] ...
        OK [linked cache]
Installing bit64 [4.0.5] ...
        OK [linked cache]
Installing arrow [10.0.1] ...
        OK [linked cache]
Installing nanoarrow [0.0.0.9000] ...
        FAILED
Error installing package 'nanoarrow':
=====================================

* installing *source* package 'nanoarrow' ...
** using staged installation

   **********************************************
   WARNING: this package has a configure script
         It probably needs manual configuration
   **********************************************

** libs
gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c altrep.c -o altrep.o
In file included from altrep.c:26:
array.h:24:10: fatal error: nanoarrow.h: No such file or directory
   24 | #include "nanoarrow.h"
      |          ^~~~~~~~~~~~~
compilation terminated.
make: *** [C:/R/R-42~1.2/etc/x64/Makeconf:253: altrep.o] Error 1
ERROR: compilation failed for package 'nanoarrow'
* removing 'C:/Users/soren/Documents/projs/r-polars/renv/staging/1/nanoarrow'
Error: install of package 'nanoarrow' failed [error code 1]
Traceback (most recent calls last):
12: renv::restore()
11: renv_restore_run_actions(project, diff, current, lockfile, rebuild)
10: renv_install_impl(records)
 9: renv_install_staged(records)
 8: renv_install_default(records)
 7: handler(package, renv_install_package(record))
 6: renv_install_package(record)
 5: withCallingHandlers(renv_install_package_impl(record), error = function(e) {
        vwritef("\tFAILED")
        writef(e$output)
    })
 4: renv_install_package_impl(record)
 3: r_cmd_install(package, path)
 2: r_exec_error(package, output, "install", status)
 1: stop(error)
> remotes::install_github("apache/arrow-nanoarrow/r")
Downloading GitHub repo apache/arrow-nanoarrow@HEAD
   checking for file 'C:\Users\soren\AppData\Local\Temp\RtmpQNEHz7\remotesffc5a62625d\apache-arrow-nanoarrow-da7b5ec\r/D✔  checking for file 'C:\Users\soren\AppData\Local\Temp\RtmpQNEHz7\remotesffc5a62625d\apache-arrow-nanoarrow-da7b5ec\r/DESCRIPTION'
─  preparing 'nanoarrow':
✔  checking DESCRIPTION meta-information
─  cleaning src
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building 'nanoarrow_0.0.0.9000.tar.gz'
   Warning: file 'nanoarrow/configure' did not have execute permissions: corrected

Installing package into 'C:/Users/soren/AppData/Local/R/cache/R/renv/library/r-polars-276af647/R-4.2/x86_64-w64-mingw32'(as 'lib' is unspecified)
* installing *source* package 'nanoarrow' ...
** using staged installation
Fetched bundled nanoarrow from https://github.com/apache/arrow-nanoarrow/tree/main/dist
** libs
Warning: this package has a non-empty 'configure.win' file,
so building only the main architecture

gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c altrep.c -o altrep.o
gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c array.c -o array.o
gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c array_stream.c -o array_stream.o
gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c array_view.c -o array_view.o
gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c buffer.c -o buffer.o
gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c convert.c -o convert.o
gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c convert_array.c -o convert_array.o
gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c convert_array_stream.c -o convert_array_stream.o
gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c infer_ptype.c -o infer_ptype.o
gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c init.c -o init.o
gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c materialize.c -o materialize.o
gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c nanoarrow.c -o nanoarrow.o
gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c pointers.c -o pointers.o
g++ -std=gnu++11  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -mfpmath=sse -msse2 -mstackrealign  -c pointers_cpp.cc -o pointers_cpp.o
gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c schema.c -o schema.o
gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c util.c -o util.o
gcc  -I"C:/R/R-42~1.2/include" -DNDEBUG     -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2 -mstackrealign  -c version.c -o version.o
g++ -std=gnu++11 -shared -s -static-libgcc -o nanoarrow.dll tmp.def altrep.o array.o array_stream.o array_view.o buffer.o convert.o convert_array.o convert_array_stream.o infer_ptype.o init.o materialize.o nanoarrow.o pointers.o pointers_cpp.o schema.o util.o version.o -LC:/rtools42/x86_64-w64-mingw32.static.posix/lib/x64 -LC:/rtools42/x86_64-w64-mingw32.static.posix/lib -LC:/R/R-42~1.2/bin/x64 -lR
installing to C:/Users/soren/AppData/Local/R/cache/R/renv/library/r-polars-276af647/R-4.2/x86_64-w64-mingw32/00LOCK-nanoarrow/00new/nanoarrow/libs/x64
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (nanoarrow)
Warning messages:
1: In untar2(tarfile, files, list, exdir, restore_times) :
  skipping pax global extended headers
2: In untar2(tarfile, files, list, exdir, restore_times) :
  skipping pax global extended headers
>
paleolimbot commented 1 year ago

I'm not sure I can help with the renv issue...once nanoarrow is on CRAN that problem will go away (and I've already filed my thoughts on the use of renv here 🙂 ).

sorhawell commented 1 year ago

Warning: file 'nanoarrow/configure' did not have execute permissions: corrected

I think that will come back and haunt you later anyhow. Not sure the CRAN win builder and/or R CMD check will allow that. Also anyone perhaps someone will build nanoarrow without remotes, and file an issue. I know file permissions are annoying, it is just something to be aware of working cross-platform projects.

paleolimbot commented 1 year ago

I believe the file permissions are correct; however, Windows git might not respect or know how to deal with attributes?

I can also try a Makevars-based solution tomorow!

Screen Shot 2023-01-11 at 4 17 06 PM
sorhawell commented 1 year ago

ok you're right, you did set the permission wrong. It think it might be this issue https://github.com/r-lib/devtools/issues/1799

cloning nanoarow and using renv::install("./r") works fine

PS C:\Users\soren\Documents\projs> git clone git@github.com:apache/arrow-nanoarrow.git nano2
Cloning into 'nano2'...
remote: Enumerating objects: 1527, done.
remote: Counting objects: 100% (395/395), done.
remote: Compressing objects: 100% (203/203), done.
remote: Total 1527 (delta 226), reused 320 (delta 187), pack-reused 1132
Receiving objects: 100% (1527/1527), 3.44 MiB | 2.84 MiB/s, done.
Resolving deltas: 100% (912/912), done.
PS C:\Users\soren\Documents\projs> cd nano2
PS C:\Users\soren\Documents\projs\nano2> C:\R\R-4.2.2\bin\R.exe
> renv::install("./r")
Installing nanoarrow [0.0.0.9000] ...
        OK [built from source]
sorhawell commented 1 year ago

not sure what changed, when checking out the pr on a windows machine it builds just fine whereas the ubuntu and mac fails tests here :/

sorhawell commented 1 year ago

now everything worked ¯(°o)/¯. ¯_(ツ)

sorhawell commented 1 year ago

all looks fine, I have merged in latest rust-polars version if all passed again I will merge