i2bc / SURFMAP

Other
20 stars 3 forks source link

[Bug]: 'utf-8' codec UnicodeDecodeError - Filename length #24

Open nchenche opened 1 week ago

nchenche commented 1 week ago

Operating System

Unix (e.g., Ubuntu 20.04)

Version

2.2.0

Python Version (optional)

3.10.12

Python Virtual Environment

venv/virtualenv/other

Execution Environment

Local environment after installation of all external dependencies

Bug Description

When running the surfmap script with input files that have filenames exceeding 50 characters, the script encounters a UnicodeDecodeError. This issue arises due to the improper handling of long filenames by MSMS, resulting in invalid characters being introduced in the processing pipeline.

Steps to Reproduce

  1. Prepare an input .pdb file with a filename longer than 50 characters, e.g., a_very_long_filename_with_more_than_50_characters.pdb.
  2. Run the surfmap script with this input file: surfmap -pdb a_very_long_filename_with_more_than_50_characters.pdb -tomap stickiness
  3. Observe the UnicodeDecodeError in the console output.

Relevant Log Output

...
SURFACE MAPPING OF THE STICKINESS PROPERTY
Step 1: computing a shell around the protein surface
Traceback (most recent call last):
  File "/home/nchenche/.venvs/surfmap/bin/surfmap", line 33, in <module>
    sys.exit(load_entry_point('surfmap', 'console_scripts', 'surfmap')())
  File "/home/nchenche/projects/SURFMAP/surfmap/bin/surfmap.py", line 63, in main
    surfmap_local(params=params)
  File "/home/nchenche/projects/SURFMAP/surfmap/bin/surfmap.py", line 18, in surfmap_local
    surfmap_from_pdb(params=params)
  File "/home/nchenche/projects/SURFMAP/surfmap/lib/core.py", line 171, in surfmap_from_pdb
    csv_coords, shell = run_compute_shell(pdb_filename=params.pdbarg, out_dir=outdir_shell, extra_radius=extra_radius)
  File "/home/nchenche/projects/SURFMAP/surfmap/tools/compute_shell.py", line 117, in run
    vert2csv(vertfile=outfile_vert, outfile=outfile_csv, skiplines=list(range(3)))
  File "/home/nchenche/projects/SURFMAP/surfmap/tools/compute_shell.py", line 72, in vert2csv
    for i, line in enumerate(_readfile):
  File "/usr/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 177: invalid continuation byte

Additional context (optional)

This bug affects the vert2csv function where the ".vert" file from MSMS contains invalid bytes characters (a_very_long_filename_with_more_than_50_characters�̌@r) introduced in its header part:

nchenche@nchenche-laptop:~/surfmap_tests/issue_23/output_SURFMAP_a_very_long_filename_with_more_than_50_characters_stickiness/shells$ head -n5 /home/nchenche/surfmap_tests/issue_23/output_SURFMAP_a_very_long_filename_with_more_than_50_characters_stickiness/shells/a_very_long_filename_with_more_than_50_characters.vert
# MSMS solvent excluded surface vertices for output_SURFMAP_a_very_long_filename_with_more_than_50_characters_stickiness/shells/a_very_long_filename_with_more_than_50_characters�̌@r
#vertex #sphere density probe_r
  62174    9364  1.00  1.50
  -57.426   -23.314    -1.586    -0.653     0.702    -0.284       0    5034  2 
  -56.932   -22.124    -2.257    -0.982    -0.092     0.163       0    5009  2 

Confirmation