microsoft / DNS-Challenge

This repo contains the scripts, models, and required files for the Deep Noise Suppression (DNS) Challenge.
Creative Commons Attribution 4.0 International
1.12k stars 414 forks source link

Duplicated data in read_speech #120

Open giamic opened 2 years ago

giamic commented 2 years ago

I have downloaded the fullband dataset and I noticed that instead datasets_fullband/clean_fullband/read_speech there is a second read_speech folder which is 117GB. At a first glance, all the files inside datasets_fullband/clean_fullband/read_speech/read_speech are already present that subfolder there are 117 GB of data that seems to be absolutely identical to the one that is already inside datasets_fullband/clean_fullband/read_speech. This seems to be confirmed by the sha1 value inside the file provided:

data = pd.read_csv("dns4-datasets-files-sha1.csv.bz2", names=["size", "sha1", "path"])
data_read = data[data["path"].str.startswith("datasets_fullband/clean_fullband/read_speech")]
len(data_read["sha1"])
Out[34]: 321996
len(data_read["sha1"].unique())
Out[35]: 196038

Is this an error? Did a lot of duplicated data just make it to the zipped archive by mistake? Did it take the place of other data that we were supposed to receive?

wqr319 commented 1 year ago

I faced same problem as yours. Many audio files represent twice. I think it may be a mistake indeed.

lingzhic commented 8 months ago

Got the same issue as well, did someone figure it out?