pik-piam / quitte

Bits and pieces of code to use with quitte-style data frames
0 stars 10 forks source link

allow filtering of large data sets in read.quitte() #75

Closed 0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q closed 11 months ago

0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q commented 11 months ago

In order to process large data sets, like IIASA data base snapshots, read.quitte() reads provided files (other then Excel files) in chunks of chunk_size lines, and applies filter.function() to the chunks. This allows for filtering data piece-by-piece, without exceeding available memory. filter.function is a function taking one argument, a quitte data frame of the read chunk, and is expected to return a data frame. Usually it should simply contain all the filters usually applied after all the data is read in. Suppose there is a file big_IIASA_snapshot.csv, from which only data for the REMIND and MESSAGE models between the years 2020 to 2050 is of interest. Normally, this data would be processed as

read.quitte(file = 'big_IIASA_snapshot.csv') %>%
    filter(grepl('^(REMIND|MESSAGE)', .data$model),
           between(.data$period, 2020, 2060))

If however big_IIASA_snapshot.csv is too large to be read in completely, it can be read using

read.quitte(file = 'big_IIASA_snapshot.csv',
            filter.function = function(x) {
                x %>%
                    filter(grepl('^(REMIND|MESSAGE)', .data$model),
                           between(.data$period, 2020, 2060))
            })

close #72

0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q commented 11 months ago

For one thing, there is no point in trying to compete with R code against highly optimised C code in terms of performance. The other thing is that my solution pivots the periods to long format before the filtering, whereas yours only pivots five lines of data. The way around that would be to filter periods and everything else differently, but that is quite messy. And in my opinion not worth the headache. Might be more useful to load snapshots into an SQL data base and query that instead of reading files.