Multiple Training / Validation Datasets🌟 [FEATURE]

tgmaxson commented 4 months ago

Is your feature request related to a problem? Please describe. It is a common problem we run across internally that we wish to train models on partial datasets which are kept in separate files as well as combined models. For example, imagine a simple case.

water-only.traj
water-NaCl.traj
water-KCl.traj

Ideally, we should be able to read these files independently in Nequip and sample from them as if they were one file. Making the file pairs quickly becomes unwieldy and expensive (in terms of space). Additionally, the cached datasets then also have to be regenerated and stored as well.

Describe the solution you'd like Simple extension to the dataloader syntax to accept a list of filenames, not just a single filename. The data would then be lumped together and used as normal. From the "ase" dataloader persepctive, this just involves appending multiple ase files together. As an alternative, ASE can also be extended to read multiple files potentially from a specialized filename, but I suspect that will get pushback from the devs (and not result in the proper caching on nequip's end).

dataset_file_name: /mnt/public/tgmaxson/datasets/7-4-24/train.traj # Single filename

or

dataset_file_name: # Multiple filenames
  - /mnt/public/tgmaxson/datasets/7-4-24/train.traj
  - /mnt/public/tgmaxson/datasets/7-2-24/train.traj

Linux-cpp-lisp commented 4 months ago

Hi @tgmaxson ,

The supported solution to this is to have all of the files concatenated into a single comprehensive dataset (which will get preprocessed once), always specify that dataset, and change the provided train_idcs and val_idcs run-to-run.

One change I'd be happy to make is to make providing these indexes more user friendly, such as by allowing them to be provided as a filename in some format, say, rather than directly as a list of integers in the YAML. Let me know what you think would be easy. (You would then be able to write one or two lines of Python to generate the index lists for different subsampling schemes with complete freedom to choose them as you like.)

Side note: if pre-processing time is significant for you, you can consider trying the new but not yet enabled by default support for matscipy neighborlists (NEQUIP_MATSCIPY_NL=true environment variable) available in 0.6.0. Or on develop, you can also try vesin (NEQUIP_NL=vesin or ase or matscipy.)

tgmaxson commented 4 months ago

When working in an iterative manner, it is common to need to add files, making the dataset be repeatedly processed fully. The concatenation operation itself is a potentially messy process and the goal should be to make things as non-error prone as possible for users working with the code.

I did not know about the train_idcs and val_idcs however. Reading these from a file would help greatly, but funny enough, my real solution would now be that a notation where you supply multiple filenames + multiple idc files would be most flexible and effective. This allows knowledge of each split to be directly represented without data duplication.

On Mon, Jul 8, 2024 at 11:07 PM Alby M. @.***> wrote:

Hi @tgmaxson https://github.com/tgmaxson ,

The supported solution to this is to have all of the files concatenated into a single comprehensive dataset (which will get preprocessed once), always specify that dataset, and change the provided train_idcs and val_idcs run-to-run.

One change I'd be happy to make is to make providing these indexes more user friendly, such as by allowing them to be provided as a filename in some format, say, rather than directly as a list of integers in the YAML. Let me know what you think would be easy. (You would then be able to write one or two lines of Python to generate the index lists for different subsampling schemes with complete freedom to choose them as you like.)

Side note: if pre-processing time is significant for you, you can consider trying the new but not yet enabled by default support for matscipy neighborlists (NEQUIP_MATSCIPY_NL=true environment variable) available in 0.6.0. Or on develop, you can also try vesin (NEQUIP_NL=vesin or ase or matscipy.)

— Reply to this email directly, view it on GitHub https://github.com/mir-group/nequip/issues/441#issuecomment-2216651481, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSBTYENBRH4IBUKWHJTRELZLN43PAVCNFSM6AAAAABKRLX3M6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJWGY2TCNBYGE . You are receiving this because you were mentioned.Message ID: @.***>

Linux-cpp-lisp commented 4 months ago

This is a good point, thanks--let me think a little more about how this could integrate into other fixes to the dataset architecture.

mir-group / nequip

Multiple Training / Validation Datasets🌟 [FEATURE] #441