Running the script on test data provided in the repo returns empty csv #7

Closed milicazmarkovic closed 1 month ago

milicazmarkovic commented 1 month ago


I tried replicating the results from the paper, but ended up getting an empty csv file as output. I suspect divergence in required packages versions could be causing this, since I have not modified the code at all. Could you specify versions of required packages and python version in environment.yml? Also if you know of any other issues that could cause this to happen, let me know!

hesther commented 1 month ago

I can look into the environment later this week (don't have access to the old environments anymore), but could you meanwhile provide the terminal output of the script? I wonder whether there were e.g. all reactions lost along the pipeline

hesther commented 1 month ago

This is how it should look: Untitled

hesther commented 1 month ago

@milicazmarkovic And this is the exact environment I used just now (just now on my laptop, not back when we wrote the paper), but I am not aware of a package mismatch that could cause the output to be empty. If you send me a list of package versions you used I will try to recreate the error and track down the package that causes this, then we can specify that version in the environment.

hesther commented 1 month ago

Another question: which script exactly did you run? correct.py on the uspto_50k.csv?

milicazmarkovic commented 1 month ago

Ok so, I tried multiple versions of rdkit and rdchiral, I haven't touched the rest of the packages. I ran the following command:

python correct.py --path data/uspto_50k --reaction_column rxn_smiles --name template --nproc 20 --data_format csv

And here is the output that I am getting:

` Reading file... Preprocessing reactants... [Parallel(n_jobs=20)]: Using backend LokyBackend with 20 concurrent workers. [Parallel(n_jobs=20)]: Done 10 tasks | elapsed: 11.6s [Parallel(n_jobs=20)]: Done 308 tasks | elapsed: 11.7s [Parallel(n_jobs=20)]: Done 14481 tasks | elapsed: 13.7s [Parallel(n_jobs=20)]: Done 48545 tasks | elapsed: 17.1s [Parallel(n_jobs=20)]: Done 49977 out of 50016 | elapsed: 17.2s remaining: 0.0s [Parallel(n_jobs=20)]: Done 50016 out of 50016 | elapsed: 17.3s finished Extracting templates (Radius 1 with special groups)... 100%|██████████████████████████████████████████████████████████| 50016/50016 [00:03<00:00, 15084.85it/s] Extracting templates (Radius 1 without special groups)... 100%|██████████████████████████████████████████████████████████| 50016/50016 [00:03<00:00, 15408.17it/s] Extracting templates (Radius 0 without special groups)... 100%|██████████████████████████████████████████████████████████| 50016/50016 [00:02<00:00, 17921.67it/s] Hierarchically correcting templates... ...Unique templates in column template_r0 : 0 ...Unique templates in column template_r1 : 0 ...Correcting templates in column template_r1 [Parallel(n_jobs=20)]: Using backend LokyBackend with 20 concurrent workers. ...Unique corrected templates in column template_r1 : 0

...Unique templates in column template_r1 : 0 ...Unique templates in column template : 0 ...Correcting templates in column template [Parallel(n_jobs=20)]: Using backend LokyBackend with 20 concurrent workers. ...Unique corrected templates in column template : 0

Wrote dataframe to data/uspto_50k_corrected.csv`

milicazmarkovic commented 1 month ago

Additionally, I tried running scripts/01 and 02 steps separately, but 02 failed because output of 01 was empty...

Here are my current packages list. I assumed this was an issue with compatibility/dependency because I have not changed the code or got any meaningful errors.

Python is 3.8.19 and I don't see any package compatibility issues, but output is invariably empty regardless of input.

hesther commented 1 month ago

@milicazmarkovic Ok so your template extraction does not work (and will not, with whatever input csv you use), you can see in the terminal output that the extraction only takes 3 seconds and yields 0 templates. There is probably an error when rdchiral is called that was erroneously caught by an exception. From your list of packages, it seems like you did not install rdchiral_cpp. This is necessary for the code. The regular rdchiral does not take arguments for the radius or special group. Did you also try this with rdchiral_cpp?

hesther commented 1 month ago

If you want to, you could replace https://github.com/hesther/templatecorr/blob/7095b4b8fecd6ea06f8c603b8e8641518b37d931/templatecorr/extract_templates.py#L41 (Line 41) in extract_templates.py with

except Exception as e:

which will probably tell you that rdchiral does not take the arguments you provided because you installed rdchiral instead of rdchiral_cpp

hesther commented 1 month ago

I will fix this in a PR soon so it gives a meaningful error, thank you very much for finding that bug!

milicazmarkovic commented 1 month ago

I actually attempted running it with rdchiral_cpp first, but got the empty data frame then switched to rdchiral and ended up with the same result, but this is super helpful! I just realized that this has to do with installation of this package -- I have weird issues with some conda packages due to M1 Mac chip... I think I know how to fix this now on my end and that has nothing to do with your code. :)

Thanks for a quick response and confirmation that the code works as intended!

milicazmarkovic commented 1 month ago

I created a small pull request with option to run script using docker, which resolved this issue for me. Hopefully it helps other people with similar problem :)

hesther commented 1 month ago

Thanks! I merged the change with the docker container, as well as added a more meaningful error message when using the wrong rdchiral version. Thanks again for finding this bug!