graphnet-team / graphnet

A Deep learning library for neutrino telescopes
https://graphnet-team.github.io/graphnet/
Apache License 2.0
85 stars 85 forks source link

Make `DataConverter` accept a list of files to convert instead of only a directory #699

Open RasmusOrsoe opened 2 months ago

RasmusOrsoe commented 2 months ago

Is your feature request related to a problem? Please describe. DataConverter accepts a single argument input_dir: Union[str, List[str]] which point to one or multiple directories. These directories are searched using the GraphNeTFileReader.find_files() method to create a list of file paths for conversion.

This construction appeals to a workflow where data files of interest are copied to a separate directory and all intended for conversion.

There are examples of use cases where converting all files in a directory is unwanted behavior.

Describe the solution you'd like Make DataConverter accept a list of user-generated file paths for conversion, instead of assuming all files in the input_dir: Union[str, List[str]] should be converted.

We rename the input_dir: Union[str, List[str]] -> input: Union[str, List[str]] : A list of files and/or directories to convert

and then slightly adjust the DataConverter from

@final
    def __call__(self, input_dir: Union[str, List[str]]) -> None:
        """Extract data from files in `input_dir` and save to disk.

        Args:
            input_dir: A directory that contains the input files.
                        The directory will be searched recursively for files
                        matching the file extension.
        """
        # Get the file reader to produce a list of input files
        # in the directory
        input_files = self._file_reader.find_files(path=input_dir)
        self._launch_jobs(input_files=input_files)

to

from path import isdir, isfile

@final
    def __call__(self, input: Union[str, List[str]]) -> None:
        """Extract data from files in `input` and save to disk.

        Args:
            input: A list of file paths and/or directories containing files selected for conversion. 
                     Directories are searched recursively, and all files in the directories will be converter.
        """
        # Get the file reader to produce a list of input files
        # in the directory
         input_files = [path for path in input if isfile(path)]
         directories_to_search = [path for path in input if isdir(path)]

        files_from_directories = self._file_reader.find_files(path=directories_to_search)
        input_files.extend(files_from_directories)
        self._launch_jobs(input_files=input_files)

Additional context Multiple people have mentioned a wish for this feature