google-research / google-research

Google Research
https://research.google
Apache License 2.0
34.1k stars 7.87k forks source link

Google Landmark Federated split is missing files #946

Open marcociccone opened 2 years ago

marcociccone commented 2 years ago

Hi! I've started working with the gldv2 dataset for FL using the split provided here. In particular, I'm using the dataset loader from tensorflow federated.

I've noticed that roughly a 10% of images are missing: this exception is raised when creating the clients tfrecords. It looks that some filenames in the federated splits are not matching filenames from the original dataset.

I've collected the missing files in a json file that you can check (list of dict - one element per client - with fields user_id, total, found, missing, missing_files)

@hang-qi I'm keeping bugging you sorry :)

hang-qi commented 2 years ago

@marcociccone Can you confirm if the correct full zip package was downloaded? You may clear the cache dir, which will allow the script to redownload the package.

https://github.com/tensorflow/federated/blob/v0.19.0/tensorflow_federated/python/simulation/datasets/gldv2.py#L36-L38

marcociccone commented 2 years ago

yes, I'm pretty sure that the split file is correct. I've used that script and debugged it heavily. Cache is fine. I've built the dataset from scratch to check this.