Duplicated data in read_speech

giamic commented 2 years ago

I have downloaded the fullband dataset and I noticed that instead datasets_fullband/clean_fullband/read_speech there is a second read_speech folder which is 117GB. At a first glance, all the files inside datasets_fullband/clean_fullband/read_speech/read_speech are already present that subfolder there are 117 GB of data that seems to be absolutely identical to the one that is already inside datasets_fullband/clean_fullband/read_speech. This seems to be confirmed by the sha1 value inside the file provided:

data = pd.read_csv("dns4-datasets-files-sha1.csv.bz2", names=["size", "sha1", "path"])
data_read = data[data["path"].str.startswith("datasets_fullband/clean_fullband/read_speech")]
len(data_read["sha1"])
Out[34]: 321996
len(data_read["sha1"].unique())
Out[35]: 196038

Is this an error? Did a lot of duplicated data just make it to the zipped archive by mistake? Did it take the place of other data that we were supposed to receive?

wqr319 commented 1 year ago

I faced same problem as yours. Many audio files represent twice. I think it may be a mistake indeed.

lingzhic commented 8 months ago

Got the same issue as well, did someone figure it out?

microsoft / DNS-Challenge

Duplicated data in read_speech #120