pwwang / plkit

A wrapper of pytorch-lightning that makes you write even less code.
https://pwwang.github.com/plkit
MIT License
5 stars 1 forks source link

[SGE Runner] AttributeError: 'NoneType' object has no attribute 'f_lineno' #1

Closed samx81 closed 3 years ago

samx81 commented 3 years ago

Hi, I'm trying to get SGE runner working with the minimal example which runs MNIST task. With LocalRunner it runs successfully, but no luck with SGERunner.

Do I need to pass additional arguments to make it work?

Error Messages

[02/24/21 22:08:39] 
INFO     Wrapping up the job ...
INFO       - Workdir: ./workdir/plkit-229a4c36-4f6f-5f7c-9732-04ec09412aa1
INFO       - Script:
         ./workdir/plkit-229a4c36-4f6f-5f7c-9732-04ec09412aa1/job.sge.sh
Traceback (most recent call last):
  File "mnist_minimal.py", line 44, in <module>
    sge.run(configuration, Data, LitClassifier)
  File "/share/homes/samtsao/miniconda3/envs/aster/lib/python3.7/site-packages/plkit/runner.py",
 line 119, in run
    cmd = cmdy._(*sys.argv, _exe=sys.executable).h.strcmd
  File "/share/homes/samtsao/miniconda3/envs/aster/lib/python3.7/site-packages/cmdy/__init__.py"
, line 147, in __call__
    _will(raise_exc=False))
  File "/share/homes/samtsao/miniconda3/envs/aster/lib/python3.7/site-packages/varname/core.py",
 line 168, in will
    node = get_node(frame + 1, raise_exc=raise_exc)
  File "/share/homes/samtsao/miniconda3/envs/aster/lib/python3.7/site-packages/varname/utils.py"
, line 89, in get_node
    return get_node_by_frame(frame, raise_exc)
  File "/share/homes/samtsao/miniconda3/envs/aster/lib/python3.7/site-packages/varname/utils.py"
, line 96, in get_node_by_frame
    exect = Source.executing(frame)
  File "/share/homes/samtsao/miniconda3/envs/aster/lib/python3.7/site-packages/executing/executi
ng.py", line 265, in executing
    lineno = frame.f_lineno
AttributeError: 'NoneType' object has no attribute 'f_lineno'

Process

  1. Using "minimal example" mentioned on Docs
  2. init a SGERunner and pass it to plkit.run()

Expected result

Successfully submit jobs to grid engine

Possible Fix

I've dug into the "executing" package and found that Source.executing(frame) will return a exect.node == None, maybe I need extra configuration on my lab servers to make it work?

Others Question

Since plkit is a wrapper for pytorch-lightning, is it possible to train pytorch-lightning based toolkit model(like https://github.com/asteroid-team/asteroid/ ) by plkit's SGERunner to speed up the process?

pwwang commented 3 years ago

It looks like the dependencies break it.

Could you run this to see the versions of them:

>>> import plkit, cmdy, varname
>>> print(plkit.__version__, cmdy.__version__, varname.__version__)

As for the asteroid model, it is possible to run it with SGERunner. But you will need to wrap around a little bit:

system = System(model, optimizer, loss, train_loader, val_loader)

has to be wrapped into the structure in the boilerplate. Instead of initializing the model and data in advance, you may need to use the plkit.DataModule and plkit.Model to wrap them, and use a universal config. The runner will instantiate the model and data objects.

pwwang commented 3 years ago

@samx81 I believe this issue is solved by the latest varname. Could you try again after upgrading varname by pip install -U varname?

samx81 commented 3 years ago

Thanks for the fix!

SGERunner now runs properly, but I found that if I use GPUs to run a task, regardless of using SGERunner, it would return Segmentation fault (core dumped) at the end of the program, though it seems doesn't affect the result.

Finally thanks for the advice on running Asteroid package, I'll try if I can make it fit in the wrapper.