rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
MIT License
3.71k stars 338 forks source link

Add option to allow newlines in captions #283

Open achalddave opened 1 year ago

achalddave commented 1 year ago

Some datasets (e.g., YFCC) have new lines in captions, which causes parquet's csv module to error by default. This PR allows passing --newlines-in-captions True to img2dataset, which will in turn tell parquet to allow newlines in CSV values.

rom1504 commented 1 year ago

could you add an example of dataset for which this is needed please ?

achalddave commented 1 year ago

I needed this for YFCC 100M - did you want that in the README/in the repo somewhere?

rom1504 commented 1 year ago

yes if you could add it in https://github.com/rom1504/img2dataset/tree/main/dataset_examples it would be great

ldfandian commented 1 year ago

I also need this~ (I have a crawler, which gives me many raw web image-text pairs with newline in the text title). Looking forward to its being merged~ @achalddave

rom1504 commented 1 year ago

could you please rebase on head / resolve conflicts ?