Automatic ensemble new - Githubissues

Aske-Rosted commented 4 months ago

the following is an implementation of the feature mentioned in #720.

I had to include a type ignore on line 68. which I personally was not too happy with, but I did not manage to satisfy the pre-commit hooks without it.

The solution uses the pre-existing EnsembleDataset class to automatically combine databases from a list of databases.

Aske-Rosted commented 4 months ago

Hey @Aske-Rosted - Thank you very much for adding this new feature. I added a hint on what might be causing the mypy challenges that you mentioned.

Given that this is new functionality, could you be persuaded to add a small unit test for this here? I added an example of how such a test could look like below:
@pytest.mark.order(6)
@pytest.mark.parametrize("backend", ["sqlite"])
def test_dataset_config_dict_selection(backend: str) -> None:
    """Test constructing Dataset with multiple data paths."""
    # Arrange
    config_path = CONFIG_PATHS[backend]

    # Single dataset
    config = DatasetConfig.load(config_path)
    dataset = Dataset.from_config(config)
    # Construct multiple datasets
    config_ensemble = DatasetConfig.load(config_path)
    config_ensemble.path = [config_ensemble.path, config_ensemble.path]

    ensemble_dataset = Dataset.from_config(config)

    assert len(dataset)*2 == len(ensemble_dataset)

Hey Rasmus looked into the suggestion for a bit but since this automatic ensembling is happing during the dataloader, and not in the dataset.from_config(config) call the current test suggestion does not test the ensembling code. This does however illuminate a bit of an issue with the current implementation of the automatic ensembling. That is you can have a dataset config file that works fine as long as you only feed it to a dataloader, however if you try to manually load the dataset from the config using the dataset class it will fail due to the list of files rather than a single file...

RasmusOrsoe commented 4 months ago

@Aske-Rosted thanks for pointing this out. It's an essential function to be able to load datasets from their respective configuration files, so I think we need to make sure that this new usage doesn't break the core intention of the files.

I had a quick look at the code again, and I think if you just moved (with slight modifications) the new logic to handle multiple paths into Dataset.from_config (see https://github.com/Aske-Rosted/graphnet/blob/7baa60d2fd729e84214a9f1c5a1d0882cfeac14a/src/graphnet/data/dataset/dataset.py#L107) the problem would be solved. What do you think?

Aske-Rosted commented 4 months ago

@Aske-Rosted thanks for pointing this out. It's an essential function to be able to load datasets from their respective configuration files, so I think we need to make sure that this new usage doesn't break the core intention of the files.

I had a quick look at the code again, and I think if you just moved (with slight modifications) the new logic to handle multiple paths into Dataset.from_config (see https://github.com/Aske-Rosted/graphnet/blob/7baa60d2fd729e84214a9f1c5a1d0882cfeac14a/src/graphnet/data/dataset/dataset.py#L107) the problem would be solved. What do you think?

I agree that is probably the more appropriate location to handle multiple file paths. I will have a look at it.

graphnet-team / graphnet

Automatic ensemble new #721