nyuad-cai / MedFuse

68 stars 19 forks source link

Duplicate samples in dataset ‘partial_ehr_cxr’ #12

Closed ZhuoZHI-UCL closed 3 days ago

ZhuoZHI-UCL commented 1 year ago

The code

index = random.randint(0, len(self.ehr_files_unpaired)-1)

in datasets/fusion.py produces duplicate samples in dataset ‘partial_ehr_cxr’ about 20% (depending on the random seed). If you want to get the dataset without duplicate samples, considering use

index = index - len(self.ehr_files_paired)

Thanks.

ShazaElsharief commented 3 days ago

Thanks for the suggestion! The index mapping will be adjusted in future versions of MedFuse to avoid duplicates of unpaired EHR samples.