moritzobenauer / ProjectRaccoon

Automated construction of atomistic and coarse-grained models in the PDB format for polymer peptide conjugates.
GNU General Public License v3.0
5 stars 1 forks source link

JOSS review: sanitising the sequence file #11

Closed lorenzo-rovigatti closed 6 months ago

lorenzo-rovigatti commented 6 months ago

Hi there, here is another issue linked to the JOSS review!

I have started testing the software, and it works great on the example seq.txt.

However, trying to edit it to make things a bit different I noticed that the parsing is not very robust: if the syntax is not respected then the code exits ungracefully, without even specifying the line that is the source of the mistake. Some examples:

  1. ACE:AU:0:1 leads to
Traceback (most recent call last):
  File "/home/lorenzo/software/anaconda3/envs/raccoon/bin/project_raccoon", line 33, in <module>
    sys.exit(load_entry_point('project-raccoon', 'console_scripts', 'project_raccoon')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lorenzo/GDRIVE/REVIEWS/JOSS/ProjectRaccoon/project_raccoon/__main__.py", line 42, in main
    start_racoon(
  File "/home/lorenzo/GDRIVE/REVIEWS/JOSS/ProjectRaccoon/project_raccoon/src/ui/user_interface.py", line 59, in start_racoon
    sequence = generate_sequence(monomers=monomers, fpath=sequence_file)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lorenzo/GDRIVE/REVIEWS/JOSS/ProjectRaccoon/project_raccoon/src/functions/standard.py", line 44, in generate_sequence
    monomers.index({"name": res, "resolution": resolution_lookup})
                                               ^^^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'resolution_lookup' where it is not associated with a value
  1. ACE:AU:1 generates
Traceback (most recent call last):
  File "/home/lorenzo/software/anaconda3/envs/raccoon/bin/project_raccoon", line 33, in <module>
    sys.exit(load_entry_point('project-raccoon', 'console_scripts', 'project_raccoon')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lorenzo/GDRIVE/REVIEWS/JOSS/ProjectRaccoon/project_raccoon/__main__.py", line 42, in main
    start_racoon(
  File "/home/lorenzo/GDRIVE/REVIEWS/JOSS/ProjectRaccoon/project_raccoon/src/ui/user_interface.py", line 59, in start_racoon
    sequence = generate_sequence(monomers=monomers, fpath=sequence_file)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lorenzo/GDRIVE/REVIEWS/JOSS/ProjectRaccoon/project_raccoon/src/functions/standard.py", line 35, in generate_sequence
    res, resolution, inv, rep = line.split(":")
    ^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 4, got 3)
  1. Unsupported monomers also generate uncaught exceptions. For instance here is what happens with ACe:AA:0:10 (note the lowercase e):
Traceback (most recent call last):
  File "/home/lorenzo/software/anaconda3/envs/raccoon/bin/project_raccoon", line 33, in <module>
    sys.exit(load_entry_point('project-raccoon', 'console_scripts', 'project_raccoon')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lorenzo/GDRIVE/REVIEWS/JOSS/ProjectRaccoon/project_raccoon/__main__.py", line 42, in main
    start_racoon(
  File "/home/lorenzo/GDRIVE/REVIEWS/JOSS/ProjectRaccoon/project_raccoon/src/ui/user_interface.py", line 59, in start_racoon
    sequence = generate_sequence(monomers=monomers, fpath=sequence_file)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lorenzo/GDRIVE/REVIEWS/JOSS/ProjectRaccoon/project_raccoon/src/functions/standard.py", line 44, in generate_sequence
    monomers.index({"name": res, "resolution": resolution_lookup})
  File "/home/lorenzo/GDRIVE/REVIEWS/JOSS/ProjectRaccoon/project_raccoon/src/data/monomers.py", line 478, in index
    f"{monomer.name} in {monomer.resolution} monomer.resolution not in list"
       ^^^^^^^^^^^^
AttributeError: 'dict' object has no attribute 'name'
  1. Another issue is that nonsensical values are silently accepted (and then possibly ignored altogether). For instance, it is possible to use a negative value for the number of repeats of a monomer, and from the looks of it the code will silently skip that monomer while building the PDB. I think it would be better to spit out warnings.

In general, it would greatly help new users to have a parser that can help them understand what (and where) they are doing wrong.

kainszs commented 6 months ago

Good morning,

These are valid points. However, it may be a few days before I can take care of it as I am moving this weekend. I see from your GIthub that you also have expertise in Python. Can you recommend a library for the parser, or is it recommended writing it yourself? Are self-written exceptions sufficient, or do you know a more elegant way to catch the errors and output error messages? How would you approach this?

lorenzo-rovigatti commented 6 months ago

I have made a PR to show you a quick&dirty way of doing it. Feel free to either accept it or use it as inspiration. The idea is to generate meaningful errors containing information that the user can leverage to fix their sequence file. It can be done in a cleaner way, but I think this should suffice in your case.

kainszs commented 6 months ago

Thank you very much for your contribution. I think we'll take it for now and as soon as we have more time, we'll look at it again. Because in the long term we want to extend the functionality of the software.