rom1504 / laion-prepro

Get hundred of million of image+url from the crawling at home dataset and preprocess them
201 stars 20 forks source link

Does https://github.com/rom1504/laion-prepro/blob/main/laion5B/safety/join.py work for non-en langs? #17

Closed PranshuBansalDev closed 2 years ago

PranshuBansalDev commented 2 years ago

Issue

Our team requires removal of all nsfw content (especially nudity)

Fix

I see here - https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/ we are pointed at this script:

https://github.com/rom1504/laion-prepro/blob/main/laion5B/safety/join.py

However, I see references to 2B rather than 5B

Question

Is the script above usable for non-en langs? Or does the script only work for en langs?

rom1504 commented 2 years ago

it works for all 3 sub datasets of laion5B (laion2B-en laion2B-multi laion1B-nolang), you can get the tags from https://huggingface.co/datasets/laion/laion2B-multi-safety and similar links

you may also choose to download directly the prejoined collection that already contain the safety tag https://huggingface.co/datasets/laion/laion2B-en-joined (and similar)

rom1504 commented 2 years ago

we computed these tags from the clip image embeddings so it works regardless of the language, you can see for yourself in https://rom1504.github.io/clip-retrieval/ that it detect (almost) all nudity (you can check/uncheck safety and search for some keywords that would usually result in unsafe results)

PranshuBansalDev commented 2 years ago

Is there any chance the laion2B-multi could have a laion2B-multi-joined? Or is that work recommended to be done by the consumers of the data?

rom1504 commented 2 years ago

Is there any chance the laion2B-multi could have a laion2B-multi-joined? Or is that work recommended to be done by the consumers of the data?

haha you found the missing piece. Yeah indeed that last join is still running, it will be available in a few hours. The other 2 datasets are available joined already ;)

PranshuBansalDev commented 2 years ago

Few unrelated questions about the dataset in general (please let me know if you'd rather deal with these in separate ticket)

  1. https://huggingface.co/datasets/laion/laion2B-multi - will this eventually have a "dataset preview" available?
  2. Is there an additional column for lang info on laion2b-multi?
  3. What is the value of the nolang dataset?
  4. What do the numbers mean w.r.t. Number of unsafe samples with a probability threshold of 0.5: 0.033? Does it mean that 3.3% of the data is labelled as NSFW?
  5. Could we have a per dataset "metadata" similar to how you had one for the 400M case?

i.e. this thing was super helpful

URL and caption metadata dataset.
We provide 32 parquet files of size around 1GB (total 50GB) with the image URLs, the associated texts and additional metadata in the following format:

SAMPLE_ID | URL | TEXT | LICENSE | NSFW | similarity | WIDTH | HEIGHT

where

SAMPLE_ID:   A unique identifier
LICENSE: Where we found a Creative Commons License in the image data, we named it here like, e.g. “creativecommons.org/licenses/by-nc-sa/3.0/” – otherwise you’ll find it here a “?”
NSFW: we used CLIP to estimate if the image has NSFW content. The estimation has been pretty conservative, reducing false negatives at the cost of more false positives. Possible values are “UNLIKELY”, “UNSURE” and “NSFW”.
similarity: Value of the cosine similarity between the text and image embedding
WIDTH and HEIGHT: image size as the image was embedded. We downsized originals that were larger than 4K to 4K.
This metadata dataset purpose is to download the images for the whole dataset or a subset of it by supplying it to the very efficient [img2dataset](https://github.com/rom1504/img2dataset) tool.
rom1504 commented 2 years ago

https://huggingface.co/datasets/laion/laion2B-multi - will this eventually have a "dataset preview" available?

I believe so, but I have no control over it, it's dependent on hf infra

Is there an additional column for lang info on laion2b-multi?

yes

What is the value of the nolang dataset?

I believe it's useful if you want to train on all languages at once. Probably it contain data that is fairly unique as well. For example names often cannot be identified as a specific language and would appear more often in laion1B

What do the numbers mean w.r.t. Number of unsafe samples with a probability threshold of 0.5: 0.033 ? Does it mean that 3.3% of the data is labelled as NSFW?

yes. Note that the classifier is a bit conservative and will classify as NSFW pictures of "sexy" (and not naked) people for example

Could we have a per dataset "metadata" similar to how you had one for the 400M case?

do you mean a description of fields ?

it's actually the same as laion400M for the non-joined metadata, for the joined metadata it has punsafe and pwatermark on top

but noted, I will add that to the post

rom1504 commented 2 years ago

tracking there https://github.com/rom1504/laion-prepro/issues/18

rom1504 commented 2 years ago

btw if you can say @PranshuBansalDev ; what are you working on? what are your plans with the datasets?

rom1504 commented 2 years ago

Ok multi joined is on hf too now

PranshuBansalDev commented 2 years ago

btw if you can say @PranshuBansalDev ; what are you working on? what are your plans with the datasets?

Sorry, I'm not able to disclose at this time :(

rom1504 commented 2 years ago

That's ok. I hope it works for you!

PranshuBansalDev commented 2 years ago

Feel free to close this one out, thank you so much!

rom1504 commented 2 years ago

https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/ I've added the column descriptions and more stats there