ryanwebster90 / snip-dedup

MIT License
98 stars 6 forks source link

Possible incorrect indexing in snip_download.py? #9

Open trojblue opened 1 year ago

trojblue commented 1 year ago

Hi, I looked through the code and the original code looks like this:

    is_dup_all = np.load(dedup_set_path).ravel()
    abs_ind = 0
    for n in range(start, end):
        print(f"downloading metadata file {n}/{end}")
        url = f"https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_{n:04d}.parquet"
        response = requests.get(url)
        parquet_path = os.path.join(metadata_dir, f"metadata_{n:04d}.parquet")
        open(parquet_path, "wb").write(response.content)

        # perform the deduplication
        md = pd.read_parquet(parquet_path)
        non_dup_chunk = is_dup_all[abs_ind : abs_ind + len(md.index)]

        # take only non-dupped (uniques)
        non_dup_chunk = np.logical_not(non_dup_chunk)

        # make sure there is at least one unique
        non_dup_chunk[0] = True
        md = md[non_dup_chunk]

        # overwrite metadata
        md.to_parquet(parquet_path)
        abs_ind += len(md.index)

I believe there might be an oversight here:

instead of incrementing abs_ind by the length of the deduplicated md.index, wouldn't it be more accurate to increment it by the total number of entries in the original Parquet file before deduplication?