microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14k stars 1.81k forks source link

ImportError: Cannot use a path to identify something from __main__. #5610

Closed ekurtgl closed 1 year ago

ekurtgl commented 1 year ago

Describe the issue:

Hi,

I was able to run the demo scripts. Now, I am trying with my own architecture and I am running into this error while running the experimen.run command:

"ImportError: Cannot use a path to identify something from main.

During handling of the above exception, another exception occurred: . . . TypeError: cannot pickle '_io.BufferedReader' object."

Full Log message: ImportError Traceback (most recent call last) File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/nni/common/serializer.py:791, in get_hybrid_cls_or_func_name(cls_or_func, pickle_size_limit) 790 try: --> 791 name = _get_cls_or_func_name(cls_or_func) 792 # import success, use a path format

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/nni/common/serializer.py:770, in _get_cls_or_func_name(cls_or_func) 769 if module_name == 'main': --> 770 raise ImportError('Cannot use a path to identify something from main.') 771 full_name = module_name + '.' + cls_or_func.name

ImportError: Cannot use a path to identify something from main.

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last) Cell In[11], line 1 ----> 1 exp.run(exp_config, 8081)

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/nni/nas/experiment/pytorch.py:298, in RetiariiExperiment.run(self, config, port, debug) 291 if self._action == 'create': 292 base_model_ir, self.applied_mutators = preprocess_model( 293 self.base_model, self.evaluator, self.applied_mutators, 294 full_ir=not isinstance(canoni_conf.execution_engine, (PyEngineConfig, BenchmarkEngineConfig)), 295 dummy_input=canoni_conf.execution_engine.dummy_input 296 if isinstance(canoni_conf.execution_engine, (BaseEngineConfig, CgoEngineConfig)) else None 297 ) --> 298 self._save_experiment_checkpoint(base_model_ir, self.applied_mutators, self.strategy, 299 canoni_conf.experiment_working_directory) 300 elif self._action == 'resume': 301 base_model_ir, self.applied_mutators, self.strategy = self._load_experiment_checkpoint( 302 canoni_conf.experiment_working_directory)

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/nni/nas/experiment/pytorch.py:226, in RetiariiExperiment._save_experiment_checkpoint(self, base_model_ir, applied_mutators, strategy, exp_work_dir) 224 ckp_path = os.path.join(exp_work_dir, self.id, 'checkpoint') 225 with open(os.path.join(ckp_path, 'nas_model'), 'w') as fp: --> 226 dump(base_model_ir._dump(), fp, pickle_size_limit=int(os.getenv('PICKLE_SIZE_LIMIT', 64 * 1024))) 227 with open(os.path.join(ckp_path, 'applied_mutators'), 'w') as fp: 228 dump(applied_mutators, fp)

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/nni/common/serializer.py:341, in dump(obj, fp, use_trace, pickle_size_limit, allow_nan, json_tricks_kwargs) 339 if json_tricks_kwargs.get('compression') is not None: 340 raise ValueError('If you meant to compress the dumped payload, please use dump_bytes.') --> 341 result = _dump( 342 obj=obj, 343 fp=fp, 344 use_trace=use_trace, 345 pickle_size_limit=pickle_size_limit, 346 allow_nan=allow_nan, 347 json_tricks_kwargs) 348 return cast(str, result)

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/nni/common/serializer.py:390, in _dump(obj, fp, use_trace, pickle_size_limit, allow_nan, json_tricks_kwargs) 387 json_tricks_kwargs['allow_nan'] = allow_nan 389 if fp is not None: --> 390 return json_tricks.dump(obj, fp, obj_encoders=encoders, json_tricks_kwargs) 391 else: 392 return json_tricks.dumps(obj, obj_encoders=encoders, **json_tricks_kwargs)

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/json_tricks/nonp.py:151, in dump(obj, fp, sort_keys, cls, obj_encoders, extra_obj_encoders, primitives, compression, force_flush, allow_nan, conv_str_byte, fallback_encoders, properties, jsonkwargs) 149 if (isinstance(obj, str_type) or hasattr(obj, 'write')) and isinstance(fp, (list, dict)): 150 raise ValueError('json-tricks dump arguments are in the wrong order: provide the data to be serialized before file handle') --> 151 txt = dumps(obj, sort_keys=sort_keys, cls=cls, obj_encoders=obj_encoders, extra_obj_encoders=extra_obj_encoders, 152 primitives=primitives, compression=compression, allow_nan=allow_nan, conv_str_byte=conv_str_byte, 153 fallback_encoders=fallback_encoders, properties=properties, jsonkwargs) 154 if isinstance(fp, str_type): 155 if compression:

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/json_tricks/nonp.py:125, in dumps(obj, sort_keys, cls, obj_encoders, extra_obj_encoders, primitives, compression, allow_nan, conv_str_byte, fallback_encoders, properties, jsonkwargs) 121 cls = TricksEncoder 122 combined_encoder = cls(sort_keys=sort_keys, obj_encoders=encoders, allow_nan=allow_nan, 123 primitives=primitives, fallback_encoders=fallback_encoders, 124 properties=properties, jsonkwargs) --> 125 txt = combined_encoder.encode(obj) 126 if not is_py3 and isinstance(txt, str): 127 txt = unicode(txt, ENCODING)

File ~/anaconda3/envs/tpot/lib/python3.10/json/encoder.py:199, in JSONEncoder.encode(self, o) 195 return encode_basestring(o) 196 # This doesn't pass the iterator directly to ''.join() because the 197 # exceptions aren't as detailed. The list call should be roughly 198 # equivalent to the PySequence_Fast that ''.join() would do. --> 199 chunks = self.iterencode(o, _one_shot=True) 200 if not isinstance(chunks, (list, tuple)): 201 chunks = list(chunks)

File ~/anaconda3/envs/tpot/lib/python3.10/json/encoder.py:257, in JSONEncoder.iterencode(self, o, _one_shot) 252 else: 253 _iterencode = _make_iterencode( 254 markers, self.default, _encoder, self.indent, floatstr, 255 self.key_separator, self.item_separator, self.sort_keys, 256 self.skipkeys, _one_shot) --> 257 return _iterencode(o, 0)

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/json_tricks/encoders.py:77, in TricksEncoder.default(self, obj, *args, **kwargs) 75 prev_id = id(obj) 76 for encoder in self.obj_encoders: ---> 77 obj = encoder(obj, primitives=self.primitives, is_changed=id(obj) != prev_id, properties=self.properties) 78 if id(obj) == prev_id: 79 raise TypeError(('Object of type {0:} could not be encoded by {1:} using encoders [{2:s}]. ' 80 'You can add an encoders for this type using extra_obj_encoders. If you want to \'skip\' this ' 81 'object, consider using fallback_encoders like str or lambda o: None.').format( 82 type(obj), self.class.name, ', '.join(str(encoder) for encoder in self.obj_encoders)))

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/json_tricks/utils.py:66, in filtered_wrapper..wrapper(*args, kwargs) 65 def wrapper(*args, *kwargs): ---> 66 return encoder(args, {k: v for k, v in kwargs.items() if k in names})

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/nni/common/serializer.py:818, in _json_tricks_func_or_cls_encode(cls_or_func, primitives, pickle_size_limit) 813 if not isinstance(cls_or_func, type) and not _is_function(cls_or_func): 814 # not a function or class, continue 815 return cls_or_func 817 return { --> 818 '__nni_type__': get_hybrid_cls_or_func_name(cls_or_func, pickle_size_limit) 819 }

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/nni/common/serializer.py:795, in get_hybrid_cls_or_func_name(cls_or_func, pickle_size_limit) 793 return 'path:' + name 794 except (ImportError, AttributeError): --> 795 b = cloudpickle.dumps(cls_or_func) 796 if len(b) > pickle_size_limit: 797 raise ValueError(f'Pickle too large when trying to dump {cls_or_func}. ' 798 'Please try to raise pickle_size_limit if you insist.')

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py:73, in dumps(obj, protocol, buffer_callback) 69 with io.BytesIO() as file: 70 cp = CloudPickler( 71 file, protocol=protocol, buffer_callback=buffer_callback 72 ) ---> 73 cp.dump(obj) 74 return file.getvalue()

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py:632, in CloudPickler.dump(self, obj) 630 def dump(self, obj): 631 try: --> 632 return Pickler.dump(self, obj) 633 except RuntimeError as e: 634 if "recursion" in e.args[0]:

TypeError: cannot pickle '_io.BufferedReader' object

Log screenshot: image . . . image

Any ideas on what might be the problem? Thanks.

matluster commented 1 year ago

Could you print base_model_ir to see what we are trying to dump here?

ekurtgl commented 1 year ago

Hi @matluster,

I put the print statement here:

image

And this is the output:

image

Full line that doesn't fit into the screenshot:

Model(model_id=2, status=ModelStatus.Frozen, graphs=['_model'], evaluator=FunctionalEvaluator(_nni_symbol=<class 'nni.nas.evaluator.functional.FunctionalEvaluator'>, _nni_args=[], _nni_kwargs={'function': <function evaluate_model at 0x7f256830b880>}, _nni_call_super=True, function=<function evaluate_model at 0x7f256830b880>, arguments={}), metric=None, intermediate_metrics=[], python_class=<class 'models.pointnet2_cls_radar_nni_nas.get_model_nas'>)

Thank you!

ultmaster commented 1 year ago

Does calling nni.dump(base_model_ir) work? It looks like there is an opened file within the object that you are dumping. But from the object you have printed, I haven't see any.

ekurtgl commented 1 year ago

Hi @ultmaster,

Thank you for your suggestion.

nni.dump(base_model_ir) seems to be throwing the same error (TypeError: cannot pickle '_io.BufferedReader' object):

image

Full log:


TypeError Traceback (most recent call last) Cell In[16], line 1 ----> 1 exp.run(exp_config, 8081)

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/nni/nas/experiment/pytorch.py:301, in RetiariiExperiment.run(self, config, port, debug) 294 if self._action == 'create': 295 base_model_ir, self.applied_mutators = preprocess_model( 296 self.base_model, self.evaluator, self.applied_mutators, 297 full_ir=not isinstance(canoni_conf.execution_engine, (PyEngineConfig, BenchmarkEngineConfig)), 298 dummy_input=canoni_conf.execution_engine.dummy_input 299 if isinstance(canoni_conf.execution_engine, (BaseEngineConfig, CgoEngineConfig)) else None 300 ) --> 301 self._save_experiment_checkpoint(base_model_ir, self.applied_mutators, self.strategy, 302 canoni_conf.experiment_working_directory) 303 elif self._action == 'resume': 304 base_model_ir, self.applied_mutators, self.strategy = self._load_experiment_checkpoint( 305 canoni_conf.experiment_working_directory)

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/nni/nas/experiment/pytorch.py:227, in RetiariiExperiment._save_experiment_checkpoint(self, base_model_ir, applied_mutators, strategy, exp_work_dir) 225 print('base_model_ir:\n', base_model_ir) 226 import nni --> 227 nni.dump(base_model_ir) 228 with open(os.path.join(ckp_path, 'nas_model'), 'w') as fp: 229 dump(base_model_ir._dump(), fp, pickle_size_limit=int(os.getenv('PICKLE_SIZE_LIMIT', 64 * 1024)))

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/nni/common/serializer.py:341, in dump(obj, fp, use_trace, pickle_size_limit, allow_nan, json_tricks_kwargs) 339 if json_tricks_kwargs.get('compression') is not None: 340 raise ValueError('If you meant to compress the dumped payload, please use dump_bytes.') --> 341 result = _dump( 342 obj=obj, 343 fp=fp, 344 use_trace=use_trace, 345 pickle_size_limit=pickle_size_limit, 346 allow_nan=allow_nan, 347 json_tricks_kwargs) 348 return cast(str, result)

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/nni/common/serializer.py:392, in _dump(obj, fp, use_trace, pickle_size_limit, allow_nan, json_tricks_kwargs) 390 return json_tricks.dump(obj, fp, obj_encoders=encoders, json_tricks_kwargs) 391 else: --> 392 return json_tricks.dumps(obj, obj_encoders=encoders, **json_tricks_kwargs)

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/json_tricks/nonp.py:125, in dumps(obj, sort_keys, cls, obj_encoders, extra_obj_encoders, primitives, compression, allow_nan, conv_str_byte, fallback_encoders, properties, jsonkwargs) 121 cls = TricksEncoder 122 combined_encoder = cls(sort_keys=sort_keys, obj_encoders=encoders, allow_nan=allow_nan, 123 primitives=primitives, fallback_encoders=fallback_encoders, 124 properties=properties, jsonkwargs) --> 125 txt = combined_encoder.encode(obj) 126 if not is_py3 and isinstance(txt, str): 127 txt = unicode(txt, ENCODING)

File ~/anaconda3/envs/tpot/lib/python3.10/json/encoder.py:199, in JSONEncoder.encode(self, o) 195 return encode_basestring(o) 196 # This doesn't pass the iterator directly to ''.join() because the 197 # exceptions aren't as detailed. The list call should be roughly 198 # equivalent to the PySequence_Fast that ''.join() would do. --> 199 chunks = self.iterencode(o, _one_shot=True) 200 if not isinstance(chunks, (list, tuple)): 201 chunks = list(chunks)

File ~/anaconda3/envs/tpot/lib/python3.10/json/encoder.py:257, in JSONEncoder.iterencode(self, o, _one_shot) 252 else: 253 _iterencode = _make_iterencode( 254 markers, self.default, _encoder, self.indent, floatstr, 255 self.key_separator, self.item_separator, self.sort_keys, 256 self.skipkeys, _one_shot) --> 257 return _iterencode(o, 0)

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/json_tricks/encoders.py:77, in TricksEncoder.default(self, obj, *args, **kwargs) 75 prev_id = id(obj) 76 for encoder in self.obj_encoders: ---> 77 obj = encoder(obj, primitives=self.primitives, is_changed=id(obj) != prev_id, properties=self.properties) 78 if id(obj) == prev_id: 79 raise TypeError(('Object of type {0:} could not be encoded by {1:} using encoders [{2:s}]. ' 80 'You can add an encoders for this type using extra_obj_encoders. If you want to \'skip\' this ' 81 'object, consider using fallback_encoders like str or lambda o: None.').format( 82 type(obj), self.class.name, ', '.join(str(encoder) for encoder in self.obj_encoders)))

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/json_tricks/utils.py:66, in filtered_wrapper..wrapper(*args, kwargs) 65 def wrapper(*args, *kwargs): ---> 66 return encoder(args, {k: v for k, v in kwargs.items() if k in names})

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/nni/common/serializer.py:864, in _json_tricks_any_object_encode(obj, primitives, pickle_size_limit) 862 return obj 863 if hasattr(obj, 'class') and (hasattr(obj, 'dict') or hasattr(obj, 'slots')): --> 864 b = cloudpickle.dumps(obj) 865 if len(b) > pickle_size_limit > 0: 866 raise PayloadTooLarge(f'Pickle too large when trying to dump {obj}. This might be caused by classes that are ' 867 'not decorated by @nni.trace. Another option is to force bytes pickling and ' 868 'try to raise pickle_size_limit.')

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py:73, in dumps(obj, protocol, buffer_callback) 69 with io.BytesIO() as file: 70 cp = CloudPickler( 71 file, protocol=protocol, buffer_callback=buffer_callback 72 ) ---> 73 cp.dump(obj) 74 return file.getvalue()

File ~/anaconda3/envs/tpot/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py:632, in CloudPickler.dump(self, obj) 630 def dump(self, obj): 631 try: --> 632 return Pickler.dump(self, obj) 633 except RuntimeError as e: 634 if "recursion" in e.args[0]:

TypeError: cannot pickle '_io.BufferedReader' object

However, pickle.dump(model_space) and pickle.dump(evaluator) on the top script work without any problem. Using pickle.dump(base_model_ir) also works fine and creates the pickled file:

image

So, I think the model_space and the evaluator should be serializable. If you have any other solutions, I can give it a try. Thank you.

ultmaster commented 1 year ago

You might need to try nni.dump(model_space) and nni.dump(evaluator).

Also you can give 3.0rc1 a try if possible. You can find the installation command in readme.

ekurtgl commented 1 year ago

Hi @ultmaster ,

Thank you for your suggestion. nni.dump(model_space) works fine, but evaluator throws the error above. Do you see any problem with my evaluation function which may not be compatible with nni?:

`def evaluate_model(model_cls):

model = model_cls()
model.apply(init_weights)
model.apply(inplace_relu)
criterion = get_loss(loss_type=args.loss_type, weight=loss_class_weights)
model.to(device)
criterion.to(device)

# optimizer
if args.optimizer == 'Adam':
    optimizer = torch.optim.Adam(
        model.parameters(),
        lr=args.learning_rate,
        betas=(0.9, 0.999),
        eps=1e-08,
        weight_decay=args.decay_rate
    )
elif args.optimizer == 'AdamW':
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=args.learning_rate,
        betas=(0.9, 0.999),
        eps=1e-08,
        weight_decay=args.decay_rate
    )
else:
    optimizer = torch.optim.SGD(
        model.parameters(), 
        lr=args.learning_rate, 
        momentum=0.9, 
        weight_decay=args.decay_rate)

# LR scheduler
if args.lr_scheduler == 'Cos':
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, eta_min=0.001, T_max=12)
elif args.lr_scheduler == 'Step':
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=12, gamma=0.8)

# AMP
scaler = torch.cuda.amp.GradScaler()

# resume model training
resumed_epoch = 0
epoch_steps = 2
for epoch in range(args.epochs):
    if os.path.exists(os.path.join(args.model_dir, 'ep%03d'%(epoch//2*epoch_steps+1))):
        scheduler.step()
        continue
    else:
        resume_dir = os.path.join(args.model_dir, 'ep%03d'%(epoch//2*epoch_steps-1))
        if os.path.exists(resume_dir) and epoch%epoch_steps==0:
            print("Resumed model training at epoch %d"%(epoch))
            model.load_state_dict(torch.load(os.path.join(resume_dir, 'model.bin')))
            resumed_epoch = epoch
            break

logger.info('train model')
avg_loss = MovingAverageValue()

total_steps = 0
save_steps = 0
for epoch in tqdm(range(resumed_epoch, args.epochs), unit='epoch', desc='Train', disable=args.silent):
    model.train()
    acc_ls = []
    pbar = tqdm(train_dataloader, total=len(train_dataloader), unit='batch', desc='Train_Epoch', disable=args.silent)
    for points, target, _ in pbar:
        optimizer.zero_grad()

        points = points.to(device)
        target = target.to(device)
        pred, trans_feat = model(points)

        with torch.cuda.amp.autocast():
            loss = criterion(pred, target, trans_feat)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        total_steps += 1
        save_steps += 1

        avg_loss.add(loss.item())
        pbar.set_postfix({'loss': avg_loss.get_avg()})

        # # evaluate and save by steps
        # if save_steps >= args.save_model_steps:
        #     model_dir = os.path.join(args.model_dir, '%06d'%total_steps)
        #     evaluate_and_save(model, test_dataloader, model_dir=model_dir, total_steps=total_steps, epoch=epoch, logger=logger, device=device, silent=args.silent)
        #     save_steps = 0

    if args.silent:
        logger.info('Epoch {} trained with loss {}'.format(epoch, avg_loss.get_avg()))

    # evaluate and save by epochs
    if epoch>0 and epoch%epoch_steps == 1:
        model_dir = os.path.join(args.model_dir, 'ep%03d'%epoch)
        accuracy = evaluate_and_save_nas(model, test_dataloader, model_dir=model_dir, total_steps=total_steps, epoch=epoch, logger=logger, device=device, silent=args.silent)
        nni.report_intermediate_result(accuracy)
        acc_ls.append(accuracy)

    scheduler.step()
    pbar.close()

nni.report_final_result(np.mean(acc_ls))`

Thank you!

ekurtgl commented 1 year ago

And this is the evaluate_and_save_nas() function:

`def evaluate_and_save_nas(model, loader, model_dir, logger, total_steps, epoch, device='cuda', silent=False):

pred_labels_all = []
target_all = []
model.eval()
with torch.no_grad():
    for batch in tqdm(loader, total=len(loader), unit='batch', desc='Evalulate', disable=silent):
        points, target, _ = batch
        points = points.to(device)
        target = target.to(device)
        pred, trans_feat = model(points)
        pred_labels = torch.argmax(pred, dim=1)
        pred_labels_all.append(pred_labels.cpu().numpy())
        target_all.append(target.cpu().numpy())
pred_labels_all = np.concatenate(pred_labels_all)
target_all = np.concatenate(target_all)
metrics = {}
metrics['accuracy'] = accuracy_score(target_all, pred_labels_all)
logger.info('evaluation metrics at step {} episode {}: {}'.format(total_steps, epoch, metrics))

os.makedirs(model_dir, exist_ok=True)
np.savetxt(os.path.join(model_dir, 'test_pred_labels.txt'), pred_labels_all, fmt='%d')
np.savetxt(os.path.join(model_dir, 'test_target.txt'), target_all, fmt='%d')
with open(os.path.join(model_dir, 'metrics.txt'), 'w') as fout:
    for key, val in metrics.items():
        fout.write('{}\t{}\n'.format(key, val))

with torch.no_grad():
    model_file = os.path.join(model_dir, 'model.bin')
    torch.save(model.state_dict(), model_file)
return metrics['accuracy']`
ultmaster commented 1 year ago

What is fout?

ekurtgl commented 1 year ago

It is a text file ('metrics.txt') to store the performance results.

ultmaster commented 1 year ago

Please put that as a local variable as it appears to be the reason why serialization goes wrong.

ekurtgl commented 1 year ago

Hi @ultmaster ,

I commented out that part, but unfortunately it didn't make any difference:

image

image

Do you suspect anything else which may not be serializable in evaluate_and_save_nas() or evaluate_model() functions?

ultmaster commented 1 year ago

Have you also commented the initialization of fout?

ekurtgl commented 1 year ago

Yes, it is in the with process in 120th line of the previous comment:

image

matluster commented 1 year ago

I don't have any clue what might be the problem here.

One possible way for debugging is trying to comment out the lines until the problem disappears. Some objects here must have disturbed the serialization process, but I don't know what it is.

ekurtgl commented 1 year ago

I guess, I found the problem. The problem occurs while iterating over the train_dataloader in the evaluate_model() function. If I enable the batch for loop, nni.dump(evaluator) fails:

image image

When, I comment it out, nni.dump(evaluator) works fine:

image image

However, when I try nni.dump(train_loader), it works just fine:

image

So, I believe train_loader should be pickable and I create it using nni.trace():

image

Isn't that a bit strange? Am I missing something here? I appreciate your help.

ultmaster commented 1 year ago

Is your train dataloader a global variable or something in the context? In my experience, a variable must be belonging to one of the following cases to work well:

  1. Initialized as a local variable inside the serialized / evaluator function (recommended).
  2. As an explicit parameter of the function. Only in this case, whether to put nni.trace onto the loader makes a difference.

Putting the dataloader as an implicit dependency of dataloader is not recommended. For example, you can try the following case. It probably fails.

dataloader = ...  # no matter whether traced or not

def foo():
    ... dataloader

nni.dump(foo)

But if you dump nni.dump(dataloader) directly, I guess whether having nni.trace on it makes a difference.

ekurtgl commented 1 year ago

Moving dataloaders into the evaluate_model() function solved my problem. Thank you so much for your continuous support! @ultmaster