red / REP

Red Enhancement Process
BSD 3-Clause "New" or "Revised" License
11 stars 4 forks source link

[WISH] Choosing columns in LOAD-CSV #167

Open endo64 opened 2 months ago

endo64 commented 2 months ago

I usually work with relatively big CSV files with lots of columns (over 1000) exported from other systems and then I process them with Red.

Even though it is not difficult to add an intermediate step to delete unwanted columns from a CSV file, it would be nice to have a refinement to choose which columns will be loaded. This way loading would also be faster for big files.

columns: [1 5 27]
load-csv/columns data columns

;or

columns: ["id" "firstname" "lastname"]
load-csv/header/columns data columns
hiiamboris commented 2 months ago

I don't see how it would be faster. You should still load the whole row, then you remove unused columns from it. Same computational complexity as loading all of the rows and then removing unused columns from all of them, only with higher peak RAM usage. In fact such per-row filtering would even be slower if one uses /as-columns mode, as removing whole columns once would be faster than doing that for every row.

Besides, I don't think we should bake into load features that are orthogonal to it, as our goal is reducing complexity, not increasing it. If we need this way of removing columns from data let it be a separate function.

If this is a so common thing you do, why not simply wrap the load on a mezz level?

multi-pick: function [data indices] [map-each/only i indices [:data/:i]]    ;) ideally
multi-pick: function [data indices] [                                       ;) faster currently
    buf: clear []
    foreach i indices [append/only buf :data/:i]
    copy buf
]
load-only: function [source columns /header /as-columns] [
    data: load/:header/:as-columns source
    either any [header as-columns] [
        unless string? columns/1 [
            headers: keys-of data
            columns: map-each i columns [headers/:i]
        ] 
        remove-each [title column] data [not find columns title]
    ][
        if string? columns/1 [
            headers: keys-of data
            columns: map-each c columns [index? find headers c]
        ] 
        map-each/self/only row data [multi-pick row columns]
    ]
]

Ultimately we want the codecs to be incremental, so you would also be able to filter out data as it appears, and that would also eliminate the issue of parsing multiple resulting formats that a decoder can produce or an encoder can accept.

This also ties to the idea of having a table! datatype, where row/column operations would be a given.

endo64 commented 2 months ago

You are right, somehow, I thought load-csv actually loads the values, like dates and integers etc., that's why I said it would be faster.