Open Aske-Rosted opened 4 months ago
Hey @Aske-Rosted - Thank you very much for adding this new feature. I added a hint on what might be causing the mypy challenges that you mentioned.
Given that this is new functionality, could you be persuaded to add a small unit test for this here? I added an example of how such a test could look like below:
@pytest.mark.order(6) @pytest.mark.parametrize("backend", ["sqlite"]) def test_dataset_config_dict_selection(backend: str) -> None: """Test constructing Dataset with multiple data paths.""" # Arrange config_path = CONFIG_PATHS[backend] # Single dataset config = DatasetConfig.load(config_path) dataset = Dataset.from_config(config) # Construct multiple datasets config_ensemble = DatasetConfig.load(config_path) config_ensemble.path = [config_ensemble.path, config_ensemble.path] ensemble_dataset = Dataset.from_config(config) assert len(dataset)*2 == len(ensemble_dataset)
Hey Rasmus looked into the suggestion for a bit but since this automatic ensembling is happing during the dataloader, and not in the dataset.from_config(config)
call the current test suggestion does not test the ensembling code. This does however illuminate a bit of an issue with the current implementation of the automatic ensembling. That is you can have a dataset config file that works fine as long as you only feed it to a dataloader, however if you try to manually load the dataset from the config using the dataset class it will fail due to the list of files rather than a single file...
@Aske-Rosted thanks for pointing this out. It's an essential function to be able to load datasets from their respective configuration files, so I think we need to make sure that this new usage doesn't break the core intention of the files.
I had a quick look at the code again, and I think if you just moved (with slight modifications) the new logic to handle multiple paths into Dataset.from_config
(see https://github.com/Aske-Rosted/graphnet/blob/7baa60d2fd729e84214a9f1c5a1d0882cfeac14a/src/graphnet/data/dataset/dataset.py#L107) the problem would be solved. What do you think?
@Aske-Rosted thanks for pointing this out. It's an essential function to be able to load datasets from their respective configuration files, so I think we need to make sure that this new usage doesn't break the core intention of the files.
I had a quick look at the code again, and I think if you just moved (with slight modifications) the new logic to handle multiple paths into
Dataset.from_config
(see https://github.com/Aske-Rosted/graphnet/blob/7baa60d2fd729e84214a9f1c5a1d0882cfeac14a/src/graphnet/data/dataset/dataset.py#L107) the problem would be solved. What do you think?
I agree that is probably the more appropriate location to handle multiple file paths. I will have a look at it.
the following is an implementation of the feature mentioned in #720.
I had to include a type ignore on line 68. which I personally was not too happy with, but I did not manage to satisfy the pre-commit hooks without it.
The solution uses the pre-existing
EnsembleDataset
class to automatically combine databases from a list of databases.