Closed janosh closed 5 years ago
Thanks for reporting. Do you happen to be using a custom logger? And could you post the code you used for instantiating the mat pipe (i.e., any code relevant to defining its state)?
The problem, quite ironically, is that this line in save
is to temporarily remove the logger objects before pickling because loggers aren't picklable, but to create a temporary copy requires the object to be picklable ....
this is on my todos to fix
@utf has a fix for this, inspired from AMSET?
Instead of passing around logger objects, automatminer classes should just use the module level logger. This should be defined at the top of each file (that wants to use logging) as:
import logging
logger = logging.getLogger(__name__)
The module level logger is then used directly inside each class as normal, i.e.:
class CustomClass(object):
def __init__(self):
logger.info("Some log message")
The decision of how to print/write log messages can then be handled separately by adding handlers to the root "automatminer"
logger, without having to worry about passing round a custom logger. Using the module level logger avoids the need to pickle anything as the logger is defined in the scope of the module and is not an instance variable.
You can have the automatminer classes add handlers to the automatminer logger by default if you always want log messages to be displayed. Note that this is not a very user-friendly way of treating logging. Generally logging is something that you opt-in to, rather than have to opt-out from. I can definitely see the benefits of having logging by default though.
You can use an updated version of the current initialize_logger
function for adding the handlers. Something like:
def initialize_logger(log_dir='.', level=None) -> logging.Logger:
"""Initialize the default logger with stdout and file handlers.
Args:
log_dir (str): Path to the folder where the log file will be written.
level (int): The log level. For example logging.DEBUG.
Returns:
(Logger): A logging instance with customized formatter and handlers.
"""
level = level or logging.INFO
logger = logging.getLogger("automatminer")
logger.handlers = [] # reset logging handlers if they already exist
formatter = logging.Formatter(fmt='%(asctime)s %(levelname)-8s %(message)s',
datefmt='%Y-%m-%d %H:%M:%S')
logpath = os.path.join(log_dir, "automatminer")
if os.path.exists(logpath + ".log"):
logpath += "_" + datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
logpath += ".log"
handler = logging.FileHandler(logpath, mode='w')
handler.setFormatter(formatter)
screen_handler = logging.StreamHandler(stream=sys.stdout)
screen_handler.setFormatter(formatter)
logger.setLevel(level)
logger.addHandler(screen_handler)
logger.addHandler(handler)
return logger
Once the handlers are added to the root "automatminer" logger, you don't need to pass around the logger object as the logger and handlers are stateful.
So is the preferred way to use a custom logger to import the root logger from automatminer and add handlers to it?
from automatminer import logger, MatPipe
logger.setLevel(whatever_level)
logger.addHandler(my_file_handler)
# etc...
MatPipe.fit(my_df, target)
# Logs with whatever_level to my_file_handler?????????????
After wrapping my head around this it is beginning to make a lot more sense to me.
The point is that the handlers you add to a logger are stateful. Once you add them they won't get removed unless you manually remove them. So if you add custom handlers to the "automatminer"
logger, these handlers will be used any time a logging call is made inside any automatminer class.
The technique to use is the first one you mentioned. But it can effectively be simplified as.
import logging
logger = logging.getLogger("automatminer")
logger.setLevel(whatever_level)
logger.addHandler(my_file_handler)
MatPipe.fit(my_df, target)
@utf does it happen for you too when logging with AMSET that there are empty logs left in every imported module's directories? For example now I get a bunch of automatminer.pipeline.log
files in the top and sub directories
No, that doesn't happen in AMSET. Are you calling initialize_logger()
in each module?
No, it's only being called by the top level class!
Edit: nvm, fixed
@janosh this should be fixed now. You can log to a custom handler using the snippet @utf posted above
@ardunn Sorry for the radio silence. I was traveling the last few days.
To answer your earlier question, I'm not using a custom logger.
Since there's no new release yet, I pulled all the upstream changes to master
and pip
installed my local clone of automatminer
to try out the fix.
Now when attempting to run
pipe = amm.MatPipe()
pipe.fit(train_df, "zT")
I get the error
2019-10-22 13:20:22 INFO Problem type is: regression
2019-10-22 13:20:22 INFO Fitting MatPipe pipeline to data.
2019-10-22 13:20:22 INFO AutoFeaturizer: Starting fitting.
2019-10-22 13:20:22 INFO AutoFeaturizer: Compositions detected as strings. Attempting conversion to Composition objects...
HBox(children=(IntProgress(value=0, description='StrToComposition', max=717, style=ProgressStyle(description_w…
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/Projects/thermo/src/notebooks/screening/automatminer.py in
6 pipe = amm.MatPipe()
7 # pipe = amm.MatPipe(**pipe_config)
----> 8 pipe.fit(train, "zT")
~/Projects/automatminer/automatminer/utils/pkg.py in wrapper(*args, **kwargs)
102 def wrapper(*args, **kwargs):
103 args[0].is_fit = False
--> 104 result = func(*args, **kwargs)
105 args[0].is_fit = True
106 return result
~/Projects/automatminer/automatminer/pipeline.py in fit(self, df, target)
177 # Fit transformers on training data
178 logger.info("Fitting MatPipe pipeline to data.")
--> 179 df = self.autofeaturizer.fit_transform(df, target)
180 df = self.cleaner.fit_transform(df, target)
181 df = self.reducer.fit_transform(df, target)
~/Projects/automatminer/automatminer/base.py in fit_transform(self, df, target)
64
65 """
---> 66 return self.fit(df, target).transform(df, target)
67
68 @property
~/Projects/automatminer/automatminer/utils/log.py in wrapper(*args, **kwargs)
94 self = args[0]
95 logger.info("{}Starting {}.".format(self._log_prefix, operation))
---> 96 result = meth(*args, **kwargs)
97 logger.info("{}Finished {}.".format(self._log_prefix, operation))
98 return result
~/Projects/automatminer/automatminer/utils/pkg.py in wrapper(*args, **kwargs)
102 def wrapper(*args, **kwargs):
103 args[0].is_fit = False
--> 104 result = func(*args, **kwargs)
105 args[0].is_fit = True
106 return result
~/Projects/automatminer/automatminer/featurization/core.py in fit(self, df, target)
308
309 if featurizer_type in df.columns:
--> 310 df = self._tidy_column(df, featurizer_type)
311
312 # Remove invalid featurizers by looking at valid_fraction
~/Projects/automatminer/automatminer/featurization/core.py in _tidy_column(self, df, featurizer_type)
562 multiindex=self.multiindex,
563 ignore_errors=True,
--> 564 inplace=False,
565 )
566
/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/site-packages/matminer/featurizers/conversions.py in featurize_dataframe(self, df, col_id, **kwargs)
87 self._target_col_id = col_id
88
---> 89 return super().featurize_dataframe(df, col_id, **kwargs)
90
91 def featurize(self, *x):
/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/site-packages/matminer/featurizers/base.py in featurize_dataframe(self, df, col_id, ignore_errors, return_errors, inplace, multiindex, pbar)
338 ignore_errors=ignore_errors,
339 return_errors=return_errors,
--> 340 pbar=pbar)
341
342 # Make sure the dataframe can handle multiindices
/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/site-packages/matminer/featurizers/base.py in featurize_many(self, entries, ignore_errors, return_errors, pbar)
461 return_errors=return_errors,
462 pbar=pbar)
--> 463 with Pool(self.n_jobs) as p:
464 func = partial(self.featurize_wrapper,
465 return_errors=return_errors,
/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/multiprocessing/context.py in Pool(self, processes, initializer, initargs, maxtasksperchild)
117 from .pool import Pool
118 return Pool(processes, initializer, initargs, maxtasksperchild,
--> 119 context=self.get_context())
120
121 def RawValue(self, typecode_or_type, *args):
/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/multiprocessing/pool.py in __init__(self, processes, initializer, initargs, maxtasksperchild, context)
166 processes = os.cpu_count() or 1
167 if processes < 1:
--> 168 raise ValueError("Number of processes must be at least 1")
169
170 if initializer is not None and not callable(initializer):
ValueError: Number of processes must be at least 1
Not sure if this is due to some error on my part during the installation...
Hey @janosh what version of python are you using?
I made some changes recently trying to fix some TPOT issues and it seems to have caused some unexpected behavior, we’re working to fix it rn
On Tue, Oct 22, 2019 at 5:26 AM Janosh Riebesell notifications@github.com wrote:
@ardunn https://github.com/ardunn Sorry for the radio silence. I was traveling the last few days.
To answer your earlier question, I'm not using a custom logger.
Since there's no new release yet, I pulled all the upstream changes to master and pip installed my local clone of automatminer to try out the fix.
Now when attempting to run
pipe = amm.MatPipe()
pipe.fit(train_df, "zT")
I get the error
2019-10-22 13:20:22 INFO Problem type is: regression 2019-10-22 13:20:22 INFO Fitting MatPipe pipeline to data. 2019-10-22 13:20:22 INFO AutoFeaturizer: Starting fitting. 2019-10-22 13:20:22 INFO AutoFeaturizer: Compositions detected as strings. Attempting conversion to Composition objects...
HBox(children=(IntProgress(value=0, description='StrToComposition', max=717, style=ProgressStyle(description_w…
ValueError Traceback (most recent call last) ~/Projects/thermo/src/notebooks/screening/automatminer.py in
6 pipe = amm.MatPipe() 7 # pipe = amm.MatPipe(**pipe_config)
----> 8 pipe.fit(train, "zT")
~/Projects/automatminer/automatminer/utils/pkg.py in wrapper(*args, **kwargs)
102 def wrapper(*args, **kwargs): 103 args[0].is_fit = False
--> 104 result = func(*args, **kwargs)
105 args[0].is_fit = True 106 return result
~/Projects/automatminer/automatminer/pipeline.py in fit(self, df, target)
177 # Fit transformers on training data 178 logger.info("Fitting MatPipe pipeline to data.")
--> 179 df = self.autofeaturizer.fit_transform(df, target)
180 df = self.cleaner.fit_transform(df, target) 181 df = self.reducer.fit_transform(df, target)
~/Projects/automatminer/automatminer/base.py in fit_transform(self, df, target)
64 65 """
---> 66 return self.fit(df, target).transform(df, target) 67 68 @property
~/Projects/automatminer/automatminer/utils/log.py in wrapper(*args, *kwargs) 94 self = args[0] 95 logger.info("{}Starting {}.".format(self._log_prefix, operation)) ---> 96 result = meth(args, **kwargs) 97 logger.info("{}Finished {}.".format(self._log_prefix, operation)) 98 return result
~/Projects/automatminer/automatminer/utils/pkg.py in wrapper(*args, kwargs) 102 def wrapper(*args, *kwargs): 103 args[0].is_fit = False --> 104 result = func(args, kwargs) 105 args[0].is_fit = True 106 return result
~/Projects/automatminer/automatminer/featurization/core.py in fit(self, df, target) 308 309 if featurizer_type in df.columns: --> 310 df = self._tidy_column(df, featurizer_type) 311 312 # Remove invalid featurizers by looking at valid_fraction
~/Projects/automatminer/automatminer/featurization/core.py in _tidy_column(self, df, featurizer_type) 562 multiindex=self.multiindex, 563 ignore_errors=True, --> 564 inplace=False, 565 ) 566
/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/site-packages/matminer/featurizers/conversions.py in featurize_dataframe(self, df, col_id, kwargs) 87 self._target_col_id = col_id 88 ---> 89 return super().featurize_dataframe(df, col_id, kwargs) 90 91 def featurize(self, *x):
/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/site-packages/matminer/featurizers/base.py in featurize_dataframe(self, df, col_id, ignore_errors, return_errors, inplace, multiindex, pbar) 338 ignore_errors=ignore_errors, 339 return_errors=return_errors, --> 340 pbar=pbar) 341 342 # Make sure the dataframe can handle multiindices
/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/site-packages/matminer/featurizers/base.py in featurize_many(self, entries, ignore_errors, return_errors, pbar) 461 return_errors=return_errors, 462 pbar=pbar) --> 463 with Pool(self.n_jobs) as p: 464 func = partial(self.featurize_wrapper, 465 return_errors=return_errors,
/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/multiprocessing/context.py in Pool(self, processes, initializer, initargs, maxtasksperchild) 117 from .pool import Pool 118 return Pool(processes, initializer, initargs, maxtasksperchild, --> 119 context=self.get_context()) 120 121 def RawValue(self, typecode_or_type, *args):
/usr/local/Caskroom/miniconda/base/envs/thermo/lib/python3.6/multiprocessing/pool.py in init(self, processes, initializer, initargs, maxtasksperchild, context) 166 processes = os.cpu_count() or 1 167 if processes < 1: --> 168 raise ValueError("Number of processes must be at least 1") 169 170 if initializer is not None and not callable(initializer):
ValueError: Number of processes must be at least 1
Not sure if this is due to some error on my part during the installation...
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/hackingmaterials/automatminer/issues/253?email_source=notifications&email_token=AEYDHSZRCENVNXHPC2WKYSDQP3WPNA5CNFSM4JBK4AT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB5RKRA#issuecomment-544937284, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYDHSYYX4HRCCUJBBA2GA3QP3WPNANCNFSM4JBK4ATQ .
I'm running Python 3.6.7.
Ah ok this might be the problem, as all the testing is only set up to use 3.7 (and our dev team exclusively uses 3.7+). If possible, I'd try making an env in 3.7+ and installing Automatminer there.
I'll add an issue to add CI for 3.6 but I am not sure it will be a priority in testing
Also @janosh on a separate note, in my experience trying to data mine thermoelectrics, predicting zT directly will almost always do poorly, regardless of model. I'd recommend predicting k_lattice and (for example) power factor independently.
@ardunn Thanks for the advice! I'm doing both actually and comparing performance.
@janosh your error here should be fixed as of v1.0.0.20191110
. It did not have to do with the python version I think, it was a multiprocessing error which didn't like that Pool was getting -1 as an argument instead of None (to simply use all cores)
@ardunn I think 651e35b introduced a new error. I'm now getting
TypeError Traceback (most recent call last)
~/thermo/src/notebooks/screening/automatminer.py in
7 pipe = amm.MatPipe(**pipe_config)
----> 8 pipe.fit(train, "zT")
~/miniconda3/envs/thermo/lib/python3.6/site-packages/automatminer/utils/pkg.py in wrapper(*args, **kwargs)
102 def wrapper(*args, **kwargs):
103 args[0].is_fit = False
--> 104 result = func(*args, **kwargs)
105 args[0].is_fit = True
106 return result
~/miniconda3/envs/thermo/lib/python3.6/site-packages/automatminer/pipeline.py in fit(self, df, target)
183 df = self.cleaner.fit_transform(df, target)
184 df = self.reducer.fit_transform(df, target)
--> 185 self.learner.fit(df, target)
186 logger.info("MatPipe successfully fit.")
187 self.post_fit_df = df
~/miniconda3/envs/thermo/lib/python3.6/site-packages/automatminer/utils/log.py in wrapper(*args, **kwargs)
94 self = args[0]
95 logger.info("{}Starting {}.".format(self._log_prefix, operation))
---> 96 result = meth(*args, **kwargs)
97 logger.info("{}Finished {}.".format(self._log_prefix, operation))
98 return result
~/miniconda3/envs/thermo/lib/python3.6/site-packages/automatminer/utils/pkg.py in wrapper(*args, **kwargs)
102 def wrapper(*args, **kwargs):
103 args[0].is_fit = False
--> 104 result = func(*args, **kwargs)
105 args[0].is_fit = True
106 return result
~/miniconda3/envs/thermo/lib/python3.6/site-packages/automatminer/automl/adaptors.py in fit(self, df, target, **fit_kwargs)
135 self._features = df.drop(columns=target).columns.tolist()
136 self._fitted_target = target
--> 137 self._backend = self._backend.fit(X, y, **fit_kwargs)
138 return self
139
~/miniconda3/envs/thermo/lib/python3.6/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups)
627
628 """
--> 629 self._fit_init()
630 features, target = self._check_dataset(features, target, sample_weight)
631
~/miniconda3/envs/thermo/lib/python3.6/site-packages/tpot/base.py in _fit_init(self)
580 'The value 0 of n_jobs is invalid.'
581 )
--> 582 elif self.n_jobs < 0:
583 self._n_jobs = cpu_count() + 1 + self.n_jobs
584 else:
TypeError: '<' not supported between instances of 'NoneType' and 'int'
hey @janosh thanks for the report. I was able to reproduce the issue and pushed a fix. I am bit less likely to see problems with the presets since most of the workflows I have been running with automatminer are "custom pipelines" where I specify all the relevant config manually.
tl;'dr, much appreciated :)
Just updated to
automatminer@2019.10.14
and I'm now getting a type error with the following stack trace when attempting to save a pipe.