hackingmaterials / automatminer

An automatic engine for predicting materials properties.
Other
138 stars 50 forks source link

TypeError: can't pickle _thread.RLock objects when saving a pipe #253

Closed janosh closed 5 years ago

janosh commented 5 years ago

Just updated to automatminer@2019.10.14 and I'm now getting a type error with the following stack trace when attempting to save a pipe.

TypeError                                 Traceback (most recent call last)
      1 pipe.summarize(log_dir + "pipe_summary.yml")
----> 2 pipe.save(log_dir + "mat.pipe")
      3 
      4 # #%%
      5 # pipe.digest(log_dir + "digest.txt")

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/site-packages/automatminer/utils/pkg.py in wrapper(*args, **kwargs)
     80                                  "".format(args[0].__class__.__name__))
     81         else:
---> 82             return func(*args, **kwargs)
     83 
     84     return wrapper

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/site-packages/automatminer/pipeline.py in save(self, filename)
    415         self.learner.serialize()
    416 
--> 417         temp_logger = copy.deepcopy(self._logger)
    418         loggables = [
    419             self, self.learner, self.reducer, self.cleaner, self.autofeaturizer

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
    178                     y = x
    179                 else:
--> 180                     y = _reconstruct(x, memo, *rv)
    181 
    182     # If is its own copy, don't memoize.

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/copy.py in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
    278     if state is not None:
    279         if deep:
--> 280             state = deepcopy(state, memo)
    281         if hasattr(y, '__setstate__'):
    282             y.__setstate__(state)

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
    148     copier = _deepcopy_dispatch.get(cls)
    149     if copier:
--> 150         y = copier(x, memo)
    151     else:
    152         try:

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/copy.py in _deepcopy_dict(x, memo, deepcopy)
    238     memo[id(x)] = y
    239     for key, value in x.items():
--> 240         y[deepcopy(key, memo)] = deepcopy(value, memo)
    241     return y
    242 d[dict] = _deepcopy_dict

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
    148     copier = _deepcopy_dispatch.get(cls)
    149     if copier:
--> 150         y = copier(x, memo)
    151     else:
    152         try:

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/copy.py in _deepcopy_list(x, memo, deepcopy)
    213     append = y.append
    214     for a in x:
--> 215         append(deepcopy(a, memo))
    216     return y
    217 d[list] = _deepcopy_list

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
    178                     y = x
    179                 else:
--> 180                     y = _reconstruct(x, memo, *rv)
    181 
    182     # If is its own copy, don't memoize.

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/copy.py in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
    278     if state is not None:
    279         if deep:
--> 280             state = deepcopy(state, memo)
    281         if hasattr(y, '__setstate__'):
    282             y.__setstate__(state)

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
    148     copier = _deepcopy_dispatch.get(cls)
    149     if copier:
--> 150         y = copier(x, memo)
    151     else:
    152         try:

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/copy.py in _deepcopy_dict(x, memo, deepcopy)
    238     memo[id(x)] = y
    239     for key, value in x.items():
--> 240         y[deepcopy(key, memo)] = deepcopy(value, memo)
    241     return y
    242 d[dict] = _deepcopy_dict

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
    167                     reductor = getattr(x, "__reduce_ex__", None)
    168                     if reductor:
--> 169                         rv = reductor(4)
    170                     else:
    171                         reductor = getattr(x, "__reduce__", None)

TypeError: can't pickle _thread.RLock objects
ardunn commented 5 years ago

Thanks for reporting. Do you happen to be using a custom logger? And could you post the code you used for instantiating the mat pipe (i.e., any code relevant to defining its state)?

The problem, quite ironically, is that this line in save is to temporarily remove the logger objects before pickling because loggers aren't picklable, but to create a temporary copy requires the object to be picklable ....

this is on my todos to fix

ardunn commented 5 years ago

@utf has a fix for this, inspired from AMSET?

utf commented 5 years ago

Instead of passing around logger objects, automatminer classes should just use the module level logger. This should be defined at the top of each file (that wants to use logging) as:

import logging

logger = logging.getLogger(__name__)

The module level logger is then used directly inside each class as normal, i.e.:

class CustomClass(object):

    def __init__(self):
        logger.info("Some log message")

The decision of how to print/write log messages can then be handled separately by adding handlers to the root "automatminer" logger, without having to worry about passing round a custom logger. Using the module level logger avoids the need to pickle anything as the logger is defined in the scope of the module and is not an instance variable.

You can have the automatminer classes add handlers to the automatminer logger by default if you always want log messages to be displayed. Note that this is not a very user-friendly way of treating logging. Generally logging is something that you opt-in to, rather than have to opt-out from. I can definitely see the benefits of having logging by default though.

You can use an updated version of the current initialize_logger function for adding the handlers. Something like:

def initialize_logger(log_dir='.', level=None) -> logging.Logger:
    """Initialize the default logger with stdout and file handlers.

    Args:
        log_dir (str): Path to the folder where the log file will be written.
        level (int): The log level. For example logging.DEBUG.

    Returns:
        (Logger): A logging instance with customized formatter and handlers.
    """
    level = level or logging.INFO

    logger = logging.getLogger("automatminer")

    logger.handlers = []  # reset logging handlers if they already exist

    formatter = logging.Formatter(fmt='%(asctime)s %(levelname)-8s %(message)s',
                                  datefmt='%Y-%m-%d %H:%M:%S')

    logpath = os.path.join(log_dir, "automatminer")
    if os.path.exists(logpath + ".log"):
        logpath += "_" + datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
    logpath += ".log"

    handler = logging.FileHandler(logpath, mode='w')
    handler.setFormatter(formatter)
    screen_handler = logging.StreamHandler(stream=sys.stdout)
    screen_handler.setFormatter(formatter)

    logger.setLevel(level)
    logger.addHandler(screen_handler)
    logger.addHandler(handler)
    return logger

Once the handlers are added to the root "automatminer" logger, you don't need to pass around the logger object as the logger and handlers are stateful.

ardunn commented 5 years ago

So is the preferred way to use a custom logger to import the root logger from automatminer and add handlers to it?

from automatminer import logger, MatPipe

logger.setLevel(whatever_level)
logger.addHandler(my_file_handler)
# etc...

MatPipe.fit(my_df, target)
# Logs with whatever_level to my_file_handler?????????????

After wrapping my head around this it is beginning to make a lot more sense to me.

utf commented 5 years ago

The point is that the handlers you add to a logger are stateful. Once you add them they won't get removed unless you manually remove them. So if you add custom handlers to the "automatminer" logger, these handlers will be used any time a logging call is made inside any automatminer class.

The technique to use is the first one you mentioned. But it can effectively be simplified as.

import logging

logger = logging.getLogger("automatminer")
logger.setLevel(whatever_level)
logger.addHandler(my_file_handler)

MatPipe.fit(my_df, target)
ardunn commented 5 years ago

@utf does it happen for you too when logging with AMSET that there are empty logs left in every imported module's directories? For example now I get a bunch of automatminer.pipeline.log files in the top and sub directories

utf commented 5 years ago

No, that doesn't happen in AMSET. Are you calling initialize_logger() in each module?

ardunn commented 5 years ago

No, it's only being called by the top level class!

Edit: nvm, fixed

ardunn commented 5 years ago

@janosh this should be fixed now. You can log to a custom handler using the snippet @utf posted above

janosh commented 5 years ago

@ardunn Sorry for the radio silence. I was traveling the last few days.

To answer your earlier question, I'm not using a custom logger.

Since there's no new release yet, I pulled all the upstream changes to master and pip installed my local clone of automatminer to try out the fix.

Now when attempting to run

pipe = amm.MatPipe()
pipe.fit(train_df, "zT")

I get the error

2019-10-22 13:20:22 INFO     Problem type is: regression
2019-10-22 13:20:22 INFO     Fitting MatPipe pipeline to data.
2019-10-22 13:20:22 INFO     AutoFeaturizer: Starting fitting.
2019-10-22 13:20:22 INFO     AutoFeaturizer: Compositions detected as strings. Attempting conversion to Composition objects...
HBox(children=(IntProgress(value=0, description='StrToComposition', max=717, style=ProgressStyle(description_w…
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/Projects/thermo/src/notebooks/screening/automatminer.py in 
      6 pipe = amm.MatPipe()
      7 # pipe = amm.MatPipe(**pipe_config)
----> 8 pipe.fit(train, "zT")

~/Projects/automatminer/automatminer/utils/pkg.py in wrapper(*args, **kwargs)
    102     def wrapper(*args, **kwargs):
    103         args[0].is_fit = False
--> 104         result = func(*args, **kwargs)
    105         args[0].is_fit = True
    106         return result

~/Projects/automatminer/automatminer/pipeline.py in fit(self, df, target)
    177         # Fit transformers on training data
    178         logger.info("Fitting MatPipe pipeline to data.")
--> 179         df = self.autofeaturizer.fit_transform(df, target)
    180         df = self.cleaner.fit_transform(df, target)
    181         df = self.reducer.fit_transform(df, target)

~/Projects/automatminer/automatminer/base.py in fit_transform(self, df, target)
     64 
     65         """
---> 66         return self.fit(df, target).transform(df, target)
     67 
     68     @property

~/Projects/automatminer/automatminer/utils/log.py in wrapper(*args, **kwargs)
     94             self = args[0]
     95             logger.info("{}Starting {}.".format(self._log_prefix, operation))
---> 96             result = meth(*args, **kwargs)
     97             logger.info("{}Finished {}.".format(self._log_prefix, operation))
     98             return result

~/Projects/automatminer/automatminer/utils/pkg.py in wrapper(*args, **kwargs)
    102     def wrapper(*args, **kwargs):
    103         args[0].is_fit = False
--> 104         result = func(*args, **kwargs)
    105         args[0].is_fit = True
    106         return result

~/Projects/automatminer/automatminer/featurization/core.py in fit(self, df, target)
    308 
    309             if featurizer_type in df.columns:
--> 310                 df = self._tidy_column(df, featurizer_type)
    311 
    312                 # Remove invalid featurizers by looking at valid_fraction

~/Projects/automatminer/automatminer/featurization/core.py in _tidy_column(self, df, featurizer_type)
    562                     multiindex=self.multiindex,
    563                     ignore_errors=True,
--> 564                     inplace=False,
    565                 )
    566 

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/site-packages/matminer/featurizers/conversions.py in featurize_dataframe(self, df, col_id, **kwargs)
     87             self._target_col_id = col_id
     88 
---> 89         return super().featurize_dataframe(df, col_id, **kwargs)
     90 
     91     def featurize(self, *x):

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/site-packages/matminer/featurizers/base.py in featurize_dataframe(self, df, col_id, ignore_errors, return_errors, inplace, multiindex, pbar)
    338                                        ignore_errors=ignore_errors,
    339                                        return_errors=return_errors,
--> 340                                        pbar=pbar)
    341 
    342         # Make sure the dataframe can handle multiindices

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/site-packages/matminer/featurizers/base.py in featurize_many(self, entries, ignore_errors, return_errors, pbar)
    461                                            return_errors=return_errors,
    462                                            pbar=pbar)
--> 463             with Pool(self.n_jobs) as p:
    464                 func = partial(self.featurize_wrapper,
    465                                return_errors=return_errors,

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/multiprocessing/context.py in Pool(self, processes, initializer, initargs, maxtasksperchild)
    117         from .pool import Pool
    118         return Pool(processes, initializer, initargs, maxtasksperchild,
--> 119                     context=self.get_context())
    120 
    121     def RawValue(self, typecode_or_type, *args):

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/multiprocessing/pool.py in __init__(self, processes, initializer, initargs, maxtasksperchild, context)
    166             processes = os.cpu_count() or 1
    167         if processes < 1:
--> 168             raise ValueError("Number of processes must be at least 1")
    169 
    170         if initializer is not None and not callable(initializer):

ValueError: Number of processes must be at least 1

Not sure if this is due to some error on my part during the installation...

ardunn commented 5 years ago

Hey @janosh what version of python are you using?

I made some changes recently trying to fix some TPOT issues and it seems to have caused some unexpected behavior, we’re working to fix it rn

On Tue, Oct 22, 2019 at 5:26 AM Janosh Riebesell notifications@github.com wrote:

@ardunn https://github.com/ardunn Sorry for the radio silence. I was traveling the last few days.

To answer your earlier question, I'm not using a custom logger.

Since there's no new release yet, I pulled all the upstream changes to master and pip installed my local clone of automatminer to try out the fix.

Now when attempting to run

pipe = amm.MatPipe()

pipe.fit(train_df, "zT")

I get the error

2019-10-22 13:20:22 INFO Problem type is: regression 2019-10-22 13:20:22 INFO Fitting MatPipe pipeline to data. 2019-10-22 13:20:22 INFO AutoFeaturizer: Starting fitting. 2019-10-22 13:20:22 INFO AutoFeaturizer: Compositions detected as strings. Attempting conversion to Composition objects...

HBox(children=(IntProgress(value=0, description='StrToComposition', max=717, style=ProgressStyle(description_w…

ValueError Traceback (most recent call last) ~/Projects/thermo/src/notebooks/screening/automatminer.py in

  6 pipe = amm.MatPipe()

  7 # pipe = amm.MatPipe(**pipe_config)

----> 8 pipe.fit(train, "zT")

~/Projects/automatminer/automatminer/utils/pkg.py in wrapper(*args, **kwargs)

102     def wrapper(*args, **kwargs):

103         args[0].is_fit = False

--> 104 result = func(*args, **kwargs)

105         args[0].is_fit = True

106         return result

~/Projects/automatminer/automatminer/pipeline.py in fit(self, df, target)

177         # Fit transformers on training data

178         logger.info("Fitting MatPipe pipeline to data.")

--> 179 df = self.autofeaturizer.fit_transform(df, target)

180         df = self.cleaner.fit_transform(df, target)

181         df = self.reducer.fit_transform(df, target)

~/Projects/automatminer/automatminer/base.py in fit_transform(self, df, target)

 64

 65         """

---> 66 return self.fit(df, target).transform(df, target) 67 68 @property

~/Projects/automatminer/automatminer/utils/log.py in wrapper(*args, *kwargs) 94 self = args[0] 95 logger.info("{}Starting {}.".format(self._log_prefix, operation)) ---> 96 result = meth(args, **kwargs) 97 logger.info("{}Finished {}.".format(self._log_prefix, operation)) 98 return result

~/Projects/automatminer/automatminer/utils/pkg.py in wrapper(*args, kwargs) 102 def wrapper(*args, *kwargs): 103 args[0].is_fit = False --> 104 result = func(args, kwargs) 105 args[0].is_fit = True 106 return result

~/Projects/automatminer/automatminer/featurization/core.py in fit(self, df, target) 308 309 if featurizer_type in df.columns: --> 310 df = self._tidy_column(df, featurizer_type) 311 312 # Remove invalid featurizers by looking at valid_fraction

~/Projects/automatminer/automatminer/featurization/core.py in _tidy_column(self, df, featurizer_type) 562 multiindex=self.multiindex, 563 ignore_errors=True, --> 564 inplace=False, 565 ) 566

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/site-packages/matminer/featurizers/conversions.py in featurize_dataframe(self, df, col_id, kwargs) 87 self._target_col_id = col_id 88 ---> 89 return super().featurize_dataframe(df, col_id, kwargs) 90 91 def featurize(self, *x):

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/site-packages/matminer/featurizers/base.py in featurize_dataframe(self, df, col_id, ignore_errors, return_errors, inplace, multiindex, pbar) 338 ignore_errors=ignore_errors, 339 return_errors=return_errors, --> 340 pbar=pbar) 341 342 # Make sure the dataframe can handle multiindices

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/site-packages/matminer/featurizers/base.py in featurize_many(self, entries, ignore_errors, return_errors, pbar) 461 return_errors=return_errors, 462 pbar=pbar) --> 463 with Pool(self.n_jobs) as p: 464 func = partial(self.featurize_wrapper, 465 return_errors=return_errors,

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/multiprocessing/context.py in Pool(self, processes, initializer, initargs, maxtasksperchild) 117 from .pool import Pool 118 return Pool(processes, initializer, initargs, maxtasksperchild, --> 119 context=self.get_context()) 120 121 def RawValue(self, typecode_or_type, *args):

/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/multiprocessing/pool.py in init(self, processes, initializer, initargs, maxtasksperchild, context) 166 processes = os.cpu_count() or 1 167 if processes < 1: --> 168 raise ValueError("Number of processes must be at least 1") 169 170 if initializer is not None and not callable(initializer):

ValueError: Number of processes must be at least 1

Not sure if this is due to some error on my part during the installation...

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/hackingmaterials/automatminer/issues/253?email_source=notifications&email_token=AEYDHSZRCENVNXHPC2WKYSDQP3WPNA5CNFSM4JBK4AT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB5RKRA#issuecomment-544937284, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYDHSYYX4HRCCUJBBA2GA3QP3WPNANCNFSM4JBK4ATQ .

janosh commented 5 years ago

I'm running Python 3.6.7.

ardunn commented 5 years ago

Ah ok this might be the problem, as all the testing is only set up to use 3.7 (and our dev team exclusively uses 3.7+). If possible, I'd try making an env in 3.7+ and installing Automatminer there.

I'll add an issue to add CI for 3.6 but I am not sure it will be a priority in testing

ardunn commented 5 years ago

Also @janosh on a separate note, in my experience trying to data mine thermoelectrics, predicting zT directly will almost always do poorly, regardless of model. I'd recommend predicting k_lattice and (for example) power factor independently.

janosh commented 5 years ago

@ardunn Thanks for the advice! I'm doing both actually and comparing performance.

ardunn commented 4 years ago

@janosh your error here should be fixed as of v1.0.0.20191110. It did not have to do with the python version I think, it was a multiprocessing error which didn't like that Pool was getting -1 as an argument instead of None (to simply use all cores)

janosh commented 4 years ago

@ardunn I think 651e35b introduced a new error. I'm now getting

TypeError                                 Traceback (most recent call last)
~/thermo/src/notebooks/screening/automatminer.py in 
      7 pipe = amm.MatPipe(**pipe_config)
----> 8 pipe.fit(train, "zT")

~/miniconda3/envs/thermo/lib/python3.6/site-packages/automatminer/utils/pkg.py in wrapper(*args, **kwargs)
    102     def wrapper(*args, **kwargs):
    103         args[0].is_fit = False
--> 104         result = func(*args, **kwargs)
    105         args[0].is_fit = True
    106         return result

~/miniconda3/envs/thermo/lib/python3.6/site-packages/automatminer/pipeline.py in fit(self, df, target)
    183         df = self.cleaner.fit_transform(df, target)
    184         df = self.reducer.fit_transform(df, target)
--> 185         self.learner.fit(df, target)
    186         logger.info("MatPipe successfully fit.")
    187         self.post_fit_df = df

~/miniconda3/envs/thermo/lib/python3.6/site-packages/automatminer/utils/log.py in wrapper(*args, **kwargs)
     94             self = args[0]
     95             logger.info("{}Starting {}.".format(self._log_prefix, operation))
---> 96             result = meth(*args, **kwargs)
     97             logger.info("{}Finished {}.".format(self._log_prefix, operation))
     98             return result

~/miniconda3/envs/thermo/lib/python3.6/site-packages/automatminer/utils/pkg.py in wrapper(*args, **kwargs)
    102     def wrapper(*args, **kwargs):
    103         args[0].is_fit = False
--> 104         result = func(*args, **kwargs)
    105         args[0].is_fit = True
    106         return result

~/miniconda3/envs/thermo/lib/python3.6/site-packages/automatminer/automl/adaptors.py in fit(self, df, target, **fit_kwargs)
    135         self._features = df.drop(columns=target).columns.tolist()
    136         self._fitted_target = target
--> 137         self._backend = self._backend.fit(X, y, **fit_kwargs)
    138         return self
    139 

~/miniconda3/envs/thermo/lib/python3.6/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups)
    627 
    628         """
--> 629         self._fit_init()
    630         features, target = self._check_dataset(features, target, sample_weight)
    631 

~/miniconda3/envs/thermo/lib/python3.6/site-packages/tpot/base.py in _fit_init(self)
    580                 'The value 0 of n_jobs is invalid.'
    581             )
--> 582         elif self.n_jobs < 0:
    583             self._n_jobs = cpu_count() + 1 + self.n_jobs
    584         else:

TypeError: '<' not supported between instances of 'NoneType' and 'int'
ardunn commented 4 years ago

hey @janosh thanks for the report. I was able to reproduce the issue and pushed a fix. I am bit less likely to see problems with the presets since most of the workflows I have been running with automatminer are "custom pipelines" where I specify all the relevant config manually.

tl;'dr, much appreciated :)