techascent / tech.ml.dataset

A Clojure high performance data processing system
Eclipse Public License 1.0
680 stars 35 forks source link

zipfile->dataset-seq should ignore unknown file types #362

Closed genmeblog closed 1 year ago

genmeblog commented 1 year ago

When zip file structure contains nested folder structure, some entries (folders) are :unknown as file type and exception is thrown.

I think it's fairly safe to ignore such entries (or print the warning) in such case.

$ unzip -l data.zip
Archive:  data.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2023-07-14 13:48   data/
      190  2022-01-21 16:19   data/family.csv
      907  2022-01-21 16:19   data/relig_income.csv
     2824  2022-01-21 16:19   data/us_rent_income.csv
   121348  2022-01-21 16:19   data/world_bank_pop.csv.gz
   134431  2022-01-21 16:19   data/who.csv.gz
        0  2023-07-14 13:49   data/iris/
     3716  2023-07-14 13:49   data/iris/iris.csv
      704  2022-01-21 16:19   data/stockstidyr.csv
     3716  2022-01-21 16:19   data/iris.csv
      421  2022-01-21 16:19   data/construction.csv
      348  2022-01-21 16:19   data/anscombe.csv
     1495  2022-01-21 16:19   data/fish_encounters.csv
    13448  2022-01-21 16:19   data/billboard.csv.gz
      144  2022-01-21 16:19   data/contacts.csv
     1368  2022-01-21 16:19   data/production.csv
      398  2022-01-21 16:19   data/warpbreaks.csv
---------                     -------
   285458                     17 files
(with-open [io (-> (tio/input-stream "data.zip")
                   (java.util.zip.ZipInputStream.))]
  (ds-io/str->file-info (.getName (.getNextEntry io))))
;; => {:gzipped? false, :file-type :unknown}
(zip/zipfile->dataset-seq "data.zip")
1. Unhandled java.lang.Exception
   Unrecognized read file type: :unknown

                    io.clj:   54  tech.v3.dataset.io/eval31702/fn
              MultiFn.java:  234  clojure.lang.MultiFn/invoke
                   zip.clj:   46  tech.v3.dataset.zip/load-zip-entry
                   zip.clj:   40  tech.v3.dataset.zip/load-zip-entry
                   zip.clj:   64  tech.v3.dataset.zip/zipfile->dataset-seq
                   zip.clj:   59  tech.v3.dataset.zip/zipfile->dataset-seq
                   zip.clj:   66  tech.v3.dataset.zip/zipfile->dataset-seq
                   zip.clj:   59  tech.v3.dataset.zip/zipfile->dataset-seq
harold commented 1 year ago

Agreed - do you have a small file that displays this behavior we could put in an automated test?

genmeblog commented 1 year ago

Yes, sure.

data.zip

zipfile-dataset-seq should return only iris and billboard datasets ingoring data, iris, billboard folders and README.md file.