Open endo64 opened 1 month ago
I don't see how it would be faster. You should still load the whole row, then you remove unused columns from it. Same computational complexity as loading all of the rows and then removing unused columns from all of them, only with higher peak RAM usage. In fact such per-row filtering would even be slower if one uses /as-columns
mode, as removing whole columns once would be faster than doing that for every row.
Besides, I don't think we should bake into load
features that are orthogonal to it, as our goal is reducing complexity, not increasing it. If we need this way of removing columns from data let it be a separate function.
If this is a so common thing you do, why not simply wrap the load
on a mezz level?
multi-pick: function [data indices] [map-each/only i indices [:data/:i]] ;) ideally
multi-pick: function [data indices] [ ;) faster currently
buf: clear []
foreach i indices [append/only buf :data/:i]
copy buf
]
load-only: function [source columns /header /as-columns] [
data: load/:header/:as-columns source
either any [header as-columns] [
unless string? columns/1 [
headers: keys-of data
columns: map-each i columns [headers/:i]
]
remove-each [title column] data [not find columns title]
][
if string? columns/1 [
headers: keys-of data
columns: map-each c columns [index? find headers c]
]
map-each/self/only row data [multi-pick row columns]
]
]
Ultimately we want the codecs to be incremental, so you would also be able to filter out data as it appears, and that would also eliminate the issue of parsing multiple resulting formats that a decoder can produce or an encoder can accept.
This also ties to the idea of having a table! datatype, where row/column operations would be a given.
You are right, somehow, I thought load-csv
actually loads the values, like dates and integers etc., that's why I said it would be faster.
I usually work with relatively big CSV files with lots of columns (over 1000) exported from other systems and then I process them with Red.
Even though it is not difficult to add an intermediate step to delete unwanted columns from a CSV file, it would be nice to have a refinement to choose which columns will be loaded. This way loading would also be faster for big files.