openmm / spice-dataset

A collection of QM data for training potential functions
MIT License
153 stars 9 forks source link

Filename collisions on osx hfs+ filesystem #52

Open jchodera opened 2 years ago

jchodera commented 2 years ago

The osx default filesystem (HFS+) is case-insensitive, which means the decision to use filename case in naming the des370k/SDFS/ and writing out individual files using SMILES strings instead of a single multi-molecule SDFs causes filename collisions and the repository cannot be properly checked out:

warning: the following paths have collided (e.g. case-sensitive paths
on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:

  'des370k/SDFS/C1CCCCC1.sdf'
  'des370k/SDFS/c1ccccc1.sdf'
  'des370k/SDFS/C1CCCNC1.sdf'
  'des370k/SDFS/c1cccnc1.sdf'
  'des370k/SDFS/CC1CCCCC1.sdf'
  'des370k/SDFS/Cc1ccccc1.sdf'
  'des370k/SDFS/OC1CCCCC1.sdf'
  'des370k/SDFS/Oc1ccccc1.sdf'

As a resolution, I repeat my previous suggestion that this should be a single multi-molecule SDF file where all SDFs are collated and titled appropriately within the file.

peastman commented 2 years ago

If you want to convert them to a single file, that would be fine.

peastman commented 2 years ago

One point to keep in mind, of course: a major purpose of this repository is to memorialize exactly how we created the dataset. If we replace the files, and change the script accordingly, they will no longer match how we created the dataset. Granted that the existing script only works on Linux. But that's the script it was created with.

jchodera commented 2 years ago

Of course. We've memorialized that in the release that was cut. That's the record of what we used to create the dataset.

Can we document this as a known bug in the release notes and avoid this practice in future? If we intend to keep adding to this repo, we can also fix the bug or else we will keep getting this error in future.