Closed genmeblog closed 1 year ago
When zip file structure contains nested folder structure, some entries (folders) are :unknown as file type and exception is thrown.
:unknown
I think it's fairly safe to ignore such entries (or print the warning) in such case.
$ unzip -l data.zip Archive: data.zip Length Date Time Name --------- ---------- ----- ---- 0 2023-07-14 13:48 data/ 190 2022-01-21 16:19 data/family.csv 907 2022-01-21 16:19 data/relig_income.csv 2824 2022-01-21 16:19 data/us_rent_income.csv 121348 2022-01-21 16:19 data/world_bank_pop.csv.gz 134431 2022-01-21 16:19 data/who.csv.gz 0 2023-07-14 13:49 data/iris/ 3716 2023-07-14 13:49 data/iris/iris.csv 704 2022-01-21 16:19 data/stockstidyr.csv 3716 2022-01-21 16:19 data/iris.csv 421 2022-01-21 16:19 data/construction.csv 348 2022-01-21 16:19 data/anscombe.csv 1495 2022-01-21 16:19 data/fish_encounters.csv 13448 2022-01-21 16:19 data/billboard.csv.gz 144 2022-01-21 16:19 data/contacts.csv 1368 2022-01-21 16:19 data/production.csv 398 2022-01-21 16:19 data/warpbreaks.csv --------- ------- 285458 17 files
(with-open [io (-> (tio/input-stream "data.zip") (java.util.zip.ZipInputStream.))] (ds-io/str->file-info (.getName (.getNextEntry io)))) ;; => {:gzipped? false, :file-type :unknown}
(zip/zipfile->dataset-seq "data.zip")
1. Unhandled java.lang.Exception Unrecognized read file type: :unknown io.clj: 54 tech.v3.dataset.io/eval31702/fn MultiFn.java: 234 clojure.lang.MultiFn/invoke zip.clj: 46 tech.v3.dataset.zip/load-zip-entry zip.clj: 40 tech.v3.dataset.zip/load-zip-entry zip.clj: 64 tech.v3.dataset.zip/zipfile->dataset-seq zip.clj: 59 tech.v3.dataset.zip/zipfile->dataset-seq zip.clj: 66 tech.v3.dataset.zip/zipfile->dataset-seq zip.clj: 59 tech.v3.dataset.zip/zipfile->dataset-seq
Agreed - do you have a small file that displays this behavior we could put in an automated test?
Yes, sure.
data.zip
zipfile-dataset-seq should return only iris and billboard datasets ingoring data, iris, billboard folders and README.md file.
zipfile-dataset-seq
iris
billboard
data
README.md
When zip file structure contains nested folder structure, some entries (folders) are
:unknown
as file type and exception is thrown.I think it's fairly safe to ignore such entries (or print the warning) in such case.