techascent / tech.ml.dataset

A Clojure high performance data processing system
Eclipse Public License 1.0
680 stars 35 forks source link

`:column-whitelist` thoughts #350

Closed harold closed 8 months ago

harold commented 1 year ago

Two thoughts:

  1. Tried :column-whitelist for the first time today:
> (ds/->dataset [{"a" 1 "b" 1} {"a" 2 "b" 2}])
_unnamed [2 2]:

| a | b |
|--:|--:|
| 1 | 1 |
| 2 | 2 |
> (ds/->dataset [{"a" 1 "b" 1} {"a" 2 "b" 2}]
                {:column-whitelist ["a"]})
_unnamed [2 2]:

| a | b |
|--:|--:|
| 1 | 1 |
| 2 | 2 |

In the second case, I expected to get a dataset with no column "b". Does that make sense?


  1. Would it be possible to get :column-allowlist and :column-blocklist as synonyms for :column-whitelist and :column-blacklist? Being allowed or blocked doesn't have anything to do with being black or white. :hugs:
harold commented 1 year ago

[not urgent - I can proceed presently with acceptable performance w/ ds/select-columns]

cnuernber commented 1 year ago

Yes - both make sense. the whitelist/blacklist are used and supported specifically only with csv and parquet file loading systems and the make the parsing far cheaper. They aren't currently passed into the new-dataset system.

Or - put more generally - it is up to the specific loading extension to either support them or not.

harold commented 1 year ago

Thanks! That helps me understand a lot.

I get now that there is more time to be saved w/ an allow list on (for example) csv over json, because a lot of the parsing is already done to make the sequence of maps in json, where csv there is more work to save by skipping entire columns.

I am going to proceed with a pathway something like json -> ds/->dataset -> ds/select-columns -> ds/row-map for now, because I think w/ all the data I need to process this way this wont be the performance bottleneck. I'll circle back around if it does become a pain/blocker.


Separately, I do think having :column-allowlist and :column-blocklist as synonyms for :column-whitelist and :column-blacklist is a good idea.

cnuernber commented 1 year ago

Yes very good. Much clearer and they don't rely on anachronisms.

otfrom commented 1 year ago

I've proposed something in charred that is an addition rather than a change to the API that should allow this: https://github.com/cnuernber/charred/pull/24

cnuernber commented 8 months ago

Closing this - it appears addressed.