Open achalddave opened 1 year ago
could you add an example of dataset for which this is needed please ?
I needed this for YFCC 100M - did you want that in the README/in the repo somewhere?
yes if you could add it in https://github.com/rom1504/img2dataset/tree/main/dataset_examples it would be great
I also need this~ (I have a crawler, which gives me many raw web image-text pairs with newline in the text title). Looking forward to its being merged~ @achalddave
could you please rebase on head / resolve conflicts ?
Some datasets (e.g., YFCC) have new lines in captions, which causes parquet's csv module to error by default. This PR allows passing
--newlines-in-captions True
to img2dataset, which will in turn tell parquet to allow newlines in CSV values.