Open ctrl-z-9000-times opened 1 year ago
Not a CI issue, but agree it would be nice to standardize.
To anybody reading this: pull requests to the relevant repositories at https://github.com/modeldbrepository are welcome.
Note that this isn't limited to NEURON models.
Hi, I opened a whole bunch of PR...
Here is a list of all of the models that I opened PR's against: 190559 253369 147460 50207 106551 146376 145836 169208 116094 127021 139656 150288 108458 126637 267106 206244 186977 184732 156120 267189 261714 140789 149739 155705 157157 2730 185513 144520 251493 266578 121060 108459 150551 112685 150024 144523 3800 149000 148253 144482 64229 229276 123897 229750 98017 - This PR is non-trivial, see the comments on this PR.
I opened an issue with the following repo: 21329
I could not determine the text encoding for the following models. 185121 7907
Most of these changes are very simple, trivial changes to comments. Take you're time reviewing these! I did this quickly because I wrote a program to do it for me, but reviewing and approving these changes is by necessity a manual and time-consuming process.
And here are the tools I used to do this, in case anyone else encounters legacy character encoding formats in the future:
Here is the messy program that I wrote:
import requests
from pathlib import Path
import subprocess
import json
import os
import sys
import chardet
commit_msg = "Convert character encoding to UTF-8"
user = "YOUR GITHUB USERNAME GOES HERE"
token = "YOUR GITHUB TOKEN GOES HERE"
# Make a cache dir to hold all of the temp files.
cache_dir = Path.cwd().joinpath('tmp_modeldb_repos')
cache_dir.mkdir(exist_ok=True)
os.chdir(cache_dir)
repo_list = []
if True:
# Get a list of all github repo's owned by user "ModelDBRepository"
page = 1
per_page = 100
while True:
print(f'Talking to github, requesting info for repo\'s {(page-1)*per_page+1} - {page*per_page} ... ', flush=1, end='')
page_data = requests.get(
'https://api.github.com/users/ModelDBRepository/repos',
params={'page': str(page), 'per_page': str(per_page)},
auth=(user, token))
print('status code:', page_data.status_code)
if page_data.status_code != 200:
print(page_data.text)
sys.exit()
page_data = json.loads(page_data.text)
page_data = [x['name'] for x in page_data]
repo_list.extend(page_data)
page += 1
if not page_data:
break
elif True:
# Use the models that are already downloaded.
repo_list = list(x.name for x in cache_dir.iterdir())
else:
#
repo_list = ['98017']
print("REPO LIST:", ', '.join(str(x) for x in repo_list), '\n')
print("NUM REPOS:", len(repo_list), '\n')
# Check each repo for non-UTF-8 text files.
changed_repos = []
failed_repos = []
repo_list = [Path(str(x)) for x in repo_list]
for repo in repo_list:
# Download the git repository.
if not repo.exists():
subprocess.run(
['git', 'clone', f'https://github.com/ModelDBRepository/{repo}.git'],
check=True,
capture_output=True)
# Find all of the text files.
hoc = list(repo.glob("**/*.hoc"))
ses = list(repo.glob("**/*.ses"))
mod = list(repo.glob("**/*.mod"))
inc = list(repo.glob("**/*.inc"))
# Fix the character encoding.
any_fixed = False
any_failed = False
for file in (hoc + ses + mod + inc):
# Ignore hidden files.
if file.name.startswith('.'):
continue
with file.open('rb') as f:
raw = f.read()
# Check if python can decode the string.
try:
raw.decode()
continue
except Exception as err:
pass
# Try to figure out what encoding this data is using.
detected_encoding = chardet.detect(raw)
if detected_encoding['encoding'] in {'utf-8', 'ascii'}:
continue
if detected_encoding['confidence'] < .5:
continue
#
try:
utf8 = raw.decode(detected_encoding['encoding'].lower())
except Exception as err:
failed_repos.append(f"{str(file)}: {str(err)}")
any_failed = True
break
#
with file.open('wt') as f:
f.write(utf8)
print('Fixed', detected_encoding['encoding'], file)
any_fixed = True
if any_failed:
continue
if any_fixed:
changed_repos.append(repo)
subprocess.run(['git', 'commit', '-a', '-m', commit_msg],
cwd=repo,
capture_output=True,
check=True)
else:
# Remove unchanged files.
# subprocess.run(['rm', '-rf', str(repo)], check=True)
pass
print()
print("FIXED MODELS:")
print('\n'.join(str(x.name) for x in changed_repos))
print("FAILURES:")
print('\n'.join(failed_repos))
Note that some models already have workarounds for this in the CI runs using iconv
: https://github.com/neuronsimulator/nrn-modeldb-ci/blob/5c0089252ba592f892f17577fed887e42474ab75/modeldb/modeldb-run.yaml#L1294-L1295
Fixing the problem at source would be much better.
Many of the models on modeldb contain files that are not UTF-8. They use an outdated text encoding format called "UCS-2" also known as "ISO 8859-1".
I know this is not anyone's priority, but it would be nice if all of the files used UTF-8 encoding. C++ does not care about this kind of stuff, but python does care and raises errors when you open these files. It's possible to work around this issue in python by either using raw bytes objects or by looking up how to decode UCS-2.
I wrote this quick script to find all of the files that use UCS-2. In could be modified to automatically update the files to use UTF-8.
And here is the list of UCS-2 files: