soachishti / moss.py

Python client for Moss: A System for Detecting Software Similarity
MIT License
381 stars 75 forks source link

Using utf-8 encoding. Use join to build path. #15

Closed nedchu closed 5 years ago

nedchu commented 5 years ago
  1. In download_report.py When downloading http://moss.stanford.edu/results/305606753/ , I find an error. image This problem is fixed by:

    f.write(str(soup.decode('utf-8', 'ignore').replace(u'\xa9', u'')))
  2. Also in download_report.py When calling download_report(url, path), if path is not end with /, the name of directory will become prefix of file. For example, calling download_report(url, "./result") will get ./resultindex.html and ./resultmatch0.html. This problem is fixed by:

    f = open(os.path.join(path, file_name), 'w')
soachishti commented 5 years ago

Hi, Thanks for highlighting the Unicode issue.

I believe replacing characters won't be a generic solution, have you tried changing file open() mode to byte i.e. "wb" instead of "w"?

nedchu commented 5 years ago

Hi, Thanks for highlighting the Unicode issue.

I believe replacing characters won't be a generic solution, have you tried changing file open() mode to byte i.e. "wb" instead of "w"?

Hi, your suggestion works!

I've changed file open mode to byte and using original encoding of html soup.original_encoding to decode soup and turn it into bytes.

During testing, the code can safely downloading http://moss.stanford.edu/results/305606753/

soachishti commented 5 years ago

Looks good, Merged.