[WISH] Choosing columns in LOAD-CSV

endo64 commented 1 month ago

I usually work with relatively big CSV files with lots of columns (over 1000) exported from other systems and then I process them with Red.

Even though it is not difficult to add an intermediate step to delete unwanted columns from a CSV file, it would be nice to have a refinement to choose which columns will be loaded. This way loading would also be faster for big files.

columns: [1 5 27]
load-csv/columns data columns

;or

columns: ["id" "firstname" "lastname"]
load-csv/header/columns data columns

hiiamboris commented 1 month ago

I don't see how it would be faster. You should still load the whole row, then you remove unused columns from it. Same computational complexity as loading all of the rows and then removing unused columns from all of them, only with higher peak RAM usage. In fact such per-row filtering would even be slower if one uses /as-columns mode, as removing whole columns once would be faster than doing that for every row.

Besides, I don't think we should bake into load features that are orthogonal to it, as our goal is reducing complexity, not increasing it. If we need this way of removing columns from data let it be a separate function.

If this is a so common thing you do, why not simply wrap the load on a mezz level?

multi-pick: function [data indices] [map-each/only i indices [:data/:i]]    ;) ideally
multi-pick: function [data indices] [                                       ;) faster currently
    buf: clear []
    foreach i indices [append/only buf :data/:i]
    copy buf
]
load-only: function [source columns /header /as-columns] [
    data: load/:header/:as-columns source
    either any [header as-columns] [
        unless string? columns/1 [
            headers: keys-of data
            columns: map-each i columns [headers/:i]
        ] 
        remove-each [title column] data [not find columns title]
    ][
        if string? columns/1 [
            headers: keys-of data
            columns: map-each c columns [index? find headers c]
        ] 
        map-each/self/only row data [multi-pick row columns]
    ]
]

Ultimately we want the codecs to be incremental, so you would also be able to filter out data as it appears, and that would also eliminate the issue of parsing multiple resulting formats that a decoder can produce or an encoder can accept.

This also ties to the idea of having a table! datatype, where row/column operations would be a given.

endo64 commented 1 month ago

You are right, somehow, I thought load-csv actually loads the values, like dates and integers etc., that's why I said it would be faster.

red / REP

[WISH] Choosing columns in LOAD-CSV #167