ppdebreuck / modnet

MODNet: a framework for machine learning materials properties
MIT License
81 stars 34 forks source link

Can't use featurize twice in a notebook without restarting the kernel #202

Closed shivang-22 closed 8 months ago

shivang-22 commented 8 months ago

I am using the following function to use MODNet on a custom dataset with compositions only:

def GNN(df, target, extra_feat, batch, lr):

    modnet_mask = (~df[target].isna())
    for f in range(len(extra_feat)):
        modnet_mask = modnet_mask & (~df[extra_feat[f]].isna())

    mod_df = df[modnet_mask]
    mod_df.reset_index(inplace=True, drop=True)

    data = MODData(
        materials=mod_df["Name"],
        targets=mod_df[target],
        target_names=[target]
    )

    data.featurize()

    for f in range(len(extra_feat)):
        data.df_featurized[extra_feat[f]] = mod_df[extra_feat[f]].values

    split = train_test_split(range(len(mod_df)), test_size=0.1, random_state=1234)
    train, test = data.split(split)
    train.feature_selection(n=-1)

    model = MODNetModel([[[target]]],
                        weights={target:1},
                        num_neurons = [[128], [64], [8], [2]],
                        n_feat = 100,
                        act =  "relu"
                       )

    model.fit(train,
              val_fraction = 0.1,
              lr = lr,
              batch_size = batch,
              loss = 'mae',
              epochs = 100,
              verbose = 1,
              callbacks=[ReduceLROnPlateau()]
             )

    pred = model.predict(test)
    mae_test = np.absolute(pred.values-test.df_targets.values).mean()
    print(f'mae: {mae_test}')

It runs fine the first time I use it, but if I change the inputs to the function and run it again in another cell, it gets stuck forever on the featurize step. So,

GNN(data_df, 'y', ['x1', 'x2'], 32, 0.02) works fine, but then in the very next cell, GNN(data_df, 'y', ['x1', 'x2'], 32, 0.04) get stuck. Am I missing something?

ml-evs commented 8 months ago

Hi @shivang-22, could you give us any more info? When you say it "gets stuck" what was the last output? featurize is directly calling matminer's featurize_many under the hood by default, which has been known to be a bit iffy with parallelism (though I'm not sure why it would work the first time on the same data). You could try explicitly setting the number of "jobs" in the featurizer with e.g. data.featurize(n_jobs=1).

shivang-22 commented 8 months ago

Certainly! So this is the error log I get when I interrupt the kernel because it got 'stuck'. I'm not pasting the message in its entirety because that would be too long, but this might help maybe.

The top of the error log is:

Cell In[15], line 18, in GNN(df, target, extra_feat, batch, lr)
     10 mod_df.reset_index(inplace=True, drop=True)
     12 data = MODData(
     13     materials=mod_df["Name"],
     14     targets=mod_df[target],
     15     target_names=[target]
     16 )
---> 18 data.featurize()
     20 for f in range(len(extra_feat)):
     21     data.df_featurized[extra_feat[f]] = mod_df[extra_feat[f]].values

File /scratch/micromamba/envs/alembic/lib/python3.10/site-packages/modnet/preprocessing.py:783, in MODData.featurize(self, fast, db_file, n_jobs, drop_allnan)
    779         df_final = df_done
    781 # otherwise, no structures were loaded, so we need to compute all
    782 else:
--> 783     df_final = self.featurizer.featurize(self.df_structure)
    785 # replace infinite values by nan that are handled during the fit
    786 df_final = clean_df(df_final, drop_allnan=drop_allnan)

File /scratch/micromamba/envs/alembic/lib/python3.10/site-packages/modnet/featurizers/featurizers.py:91, in MODFeaturizer.featurize(self, df)
     89 df_composition = pd.DataFrame([])
     90 if self.composition_featurizers or self.oxid_composition_featurizers:
---> 91     df_composition = self.featurize_composition(df)
     93 df_structure = pd.DataFrame([])
     94 if self.structure_featurizers:

This points to the fact that its still computing the features. The bottom of the error log was more interesting to me, and reads as follows:

File /scratch/micromamba/envs/alembic/lib/python3.10/site-packages/matminer/featurizers/base.py:476, in BaseFeaturizer.featurize_many(self, entries, ignore_errors, return_errors, pbar)
    470 with Pool(self.n_jobs, maxtasksperchild=1) as p:
    471     func = partial(
    472         self.featurize_wrapper,
    473         return_errors=return_errors,
    474         ignore_errors=ignore_errors,
    475     )
--> 476     res = p.map(func, entries, chunksize=self.chunksize)
    477     return res

File /scratch/micromamba/envs/alembic/lib/python3.10/multiprocessing/pool.py:367, in Pool.map(self, func, iterable, chunksize)
    362 def map(self, func, iterable, chunksize=None):
    363     '''
    364     Apply `func` to each element in `iterable`, collecting the results
    365     in a list that is returned.
    366     '''
--> 367     return self._map_async(func, iterable, mapstar, chunksize).get()

File /scratch/micromamba/envs/alembic/lib/python3.10/multiprocessing/pool.py:768, in ApplyResult.get(self, timeout)
    767 def get(self, timeout=None):
--> 768     self.wait(timeout)
    769     if not self.ready():
    770         raise TimeoutError

File /scratch/micromamba/envs/alembic/lib/python3.10/multiprocessing/pool.py:765, in ApplyResult.wait(self, timeout)
    764 def wait(self, timeout=None):
--> 765     self._event.wait(timeout)

File /scratch/micromamba/envs/alembic/lib/python3.10/threading.py:607, in Event.wait(self, timeout)
    605 signaled = self._flag
    606 if not signaled:
--> 607     signaled = self._cond.wait(timeout)
    608 return signaled

File /scratch/micromamba/envs/alembic/lib/python3.10/threading.py:320, in Condition.wait(self, timeout)
    318 try:    # restore state no matter what (e.g., KeyboardInterrupt)
    319     if timeout is None:
--> 320         waiter.acquire()
    321         gotit = True
    322     else:

KeyboardInterrupt: 

Its seems to me that the code is waiting indefinitely?

ml-evs commented 8 months ago

So it gets stuck in the parallel internals of matminer (maybe -- depends on your luck when you actually interrupt). I would rerun with n_jobs=1 as suggested above and see if you get the same problem. Otherwise you can also try changing the featurizer mode between multi and single which will change the parallelism to be over structures rather than features.

e.g. add to the snippet above:

data.featurizer.featurizer_mode = "single"

This will either "just work" or it will give us better debug info on which featurizer is causing it to hang.

shivang-22 commented 8 months ago

Okay, so both n_jobs=1 and data.featurizer.featurizer_mode = "single" work, but the speed is significantly slower than the default. The latter still (understandably) does better, but is there a way to make this method faster?

ml-evs commented 8 months ago

The speed is just a limitation of matminer unfortunately. Glad it is working now though. You can see https://github.com/hackingmaterials/matminer/issues/902 for the full description of the problem of parallelism in matminer.