lokijuhy / data-traffic-control

Whhrrrr... Voooooosh... That's the sound of your data coming and going exactly where it belongs
MIT License
0 stars 0 forks source link

Saving error #8

Open matteobe opened 4 years ago

matteobe commented 4 years ago

Saving to directory in folder inside registered folder with DataManager not working.

Target folder: /opt/data/data/processed Notebook folder: /opt/data/notebooks

DataManager ls output:

data/
    formatted/
        1 csv items
    old/
        5 mixed items
    original/
        5 mixed items
    processed/
        1 csv items

Code:

# Data management (only run first time, then it's automatic)
# DataManager.register_project('schupbach', '/opt/data/data/')

# Load data manager
dm = DataManager('schupbach')

# Import SPo2 database
spo2 = dm['original']['SPO2.csv'].load(sep=',', 
                                       nrows=100000,
                                       encoding='utf-16',
                                       error_bad_lines=True)

# Preformat the dataset
spo2 = ingestion.preformat(spo2)

dm['formatted'].save(spo2, 'test.csv', ingestion.preformat)

Stack trace:

---------------------------------------------------------------------------
InvalidGitRepositoryError                 Traceback (most recent call last)
<ipython-input-3-2a9f3b79d4da> in <module>
     14 spo2 = ingestion.preformat(spo2)
     15 
---> 16 dm['formatted'].save(spo2, 'test.csv', ingestion.preformat)

~/code/tools/data-traffic-control/datatc/data_directory.py in save(self, data, file_name, transformer_func, enforce_clean_git, get_git_hash_from, **kwargs)
     86             self.save_file(data, file_name, **kwargs)
     87         else:
---> 88             self.transform_and_save(data, transformer_func, file_name, enforce_clean_git, get_git_hash_from, **kwargs)
     89 
     90     def save_file(self, data: Any, file_name: str, **kwargs) -> None:

~/code/tools/data-traffic-control/datatc/data_directory.py in transform_and_save(self, data, transformer_func, file_name, enforce_clean_git, get_git_hash_from, **kwargs)
     95     def transform_and_save(self, data: Any, transformer_func: Callable, file_name: str, enforce_clean_git=True,
     96                            get_git_hash_from: Any = None, **kwargs) -> None:
---> 97         new_transform_dir_path = TransformedDataInterface.save(data, transformer_func, parent_path=self.path,
     98                                                                file_name=file_name, enforce_clean_git=enforce_clean_git,
     99                                                                get_git_hash_from=get_git_hash_from, **kwargs)

~/code/tools/data-traffic-control/datatc/data_transformer.py in save(cls, data, transformer_func, parent_path, file_name, enforce_clean_git, get_git_hash_from, **kwargs)
    106         if enforce_clean_git:
    107             if transformer_func_in_repo:
--> 108                 check_for_uncommitted_git_changes_at_path(transformer_func_file_repo_path)
    109             else:
    110                 raise RuntimeError('`transformer_func` is not tracked in a git repo.'

~/code/tools/data-traffic-control/datatc/git_utilities.py in check_for_uncommitted_git_changes_at_path(repo_path)
     63         True: uncommitted changes found. Repo is not valid.
     64     """
---> 65     repo = Repo(repo_path, search_parent_directories=True)
     66 
     67     try:

/usr/local/lib/python3.8/dist-packages/git/repo/base.py in __init__(self, path, odbt, search_parent_directories, expand_vars)
    179 
    180         if self.git_dir is None:
--> 181             raise InvalidGitRepositoryError(epath)
    182 
    183         self._bare = False

InvalidGitRepositoryError: /opt/data/notebooks
matteobe commented 4 years ago

Tried also dm.reload() and repeating -> also not working.

matteobe commented 4 years ago

Output of dm.ls(True)

data/
    formatted/
        SPO2.csv
    old/
        Diagnosis.csv
        PatientCases.csv
        SAPS.csv
        SPO2.csv
        datainfo.txt
    original/
        Diagnosis.csv
        PatientCases.csv
        SAPS.csv
        SPO2.csv
        datainfo.txt
    processed/
        SPO2.csv
lokijuhy commented 4 years ago

It looks like I need to add better error handling and clearer error messages for when functions are defined outside of git repos.

Where is ingestion.preformat defined? The error is saying that the file where it thinks the ingestion.preformat function is defined is not inside a valid GitHub repo. datatc thinks ingestion.preformat is defined somewhere inside /opt/data/notebooks. Is that correct? If not, then I have a second problem 😉

FYI, If you want to explicitly tell datatc which package to get a git hash from instead of letting it track down where the transform function is defined, you can use .save( ... , get_git_hash_from=schupbach) (if your python package is called schupbach).

lokijuhy commented 3 years ago

@matteobe Can I close this issue?