ModelDB contains files with invalid UTF-8

ctrl-z-9000-times commented 1 year ago

Many of the models on modeldb contain files that are not UTF-8. They use an outdated text encoding format called "UCS-2" also known as "ISO 8859-1".

https://www.ibm.com/docs/en/i/7.1?topic=unicode-ucs-2-its-relationship-utf-16

I know this is not anyone's priority, but it would be nice if all of the files used UTF-8 encoding. C++ does not care about this kind of stuff, but python does care and raises errors when you open these files. It's possible to work around this issue in python by either using raw bytes objects or by looking up how to decode UCS-2.

I wrote this quick script to find all of the files that use UCS-2. In could be modified to automatically update the files to use UTF-8.

from pathlib import Path
import zipfile
cached_dir = Path.cwd().joinpath('cache')

# Unzip all of the files.
for file in cached_dir.glob("*.zip"):
    out = cached_dir.joinpath(file.stem)
    out.mkdir(exist_ok=True)
    with zipfile.ZipFile(file, 'r') as zip_ref:
        print(file)
        zip_ref.extractall(out)

# Find all of the text files.
hoc = list(cached_dir.glob("**/*.hoc"))
ses = list(cached_dir.glob("**/*.ses"))
mod = list(cached_dir.glob("**/*.mod"))
inc = list(cached_dir.glob("**/*.inc"))

for file in (hoc + ses + mod + inc):
    if file.name.startswith('.'): continue
    try:
        _ = file.open('rt').read()
    except Exception as err:
        try: 
            bin = file.open('rb').read()
            _ = str(bin, 'iso-8859-1')
            print(file.relative_to(cached_dir))
        except Exception as err:
            print(err)

And here is the list of UCS-2 files:

DendroDendriticInhibition/ShortDendrite/bulb.hoc
DendroDendriticInhibition/LongDendrite/bulb.hoc
190559/MiglioreEJN2016/soma.hoc
253369/LombardiEtAl2019/Isolated_Dendrite_tauNKCC1__Fig9/GABA-Stim_2xPSC_Spatial_Sum_isolated_dendrite.hoc
253369/LombardiEtAl2019/Isolated_Dendrite_tauNKCC1__Fig9/GABA-Stim_2xPSC_Spatial_Sum_isolated_dendrite_enlarged_tauNKCC1.hoc
147460/OverstreetEtAl2013/Interneuron/NEURON_code/anat_type2.hoc
147460/OverstreetEtAl2013/Interneuron/NEURON_code/anat_type8.hoc
147460/OverstreetEtAl2013/Interneuron/NEURON_code/anat_type1.hoc
147460/OverstreetEtAl2013/Interneuron/NEURON_code/anat_type5.hoc
147460/OverstreetEtAl2013/Interneuron/NEURON_code/anat_type7.hoc
147460/OverstreetEtAl2013/Interneuron/NEURON_code/anat_type6.hoc
147460/OverstreetEtAl2013/Interneuron/NEURON_code/anat_type3.hoc
147460/OverstreetEtAl2013/Interneuron/NEURON_code/anat_type4.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type14.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type14sym.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type11.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type12sym.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type10sym.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type12.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type10.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type15.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type13sym.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type13.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type9sym.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type9.hoc
50207/NMDA_Mg/nmda_demo.hoc
106551/nc-mri/Neuron1.hoc
106551/nc-mri/synapses.hoc
123897/HuEtAl2009/experiment/Pyramidal/inject_soma.hoc
123897/HuEtAl2009/experiment/Pyramidal/failureThres.hoc
146376/reduction1.0/useful&InitProc.hoc
21329/inhibnet/netring.hoc
229276/Final/createsimulation.hoc
229276/Final/template.hoc
229276/Final/synapses/synapses.hoc
169208/YoungEtAl2013/NEURON files/velocity/ALG_PARCHA-130523.hoc
169208/YoungEtAl2013/NEURON files/voltage/ALG_PARCHA-130523.hoc
116094/DendroDendriticInhibition/ShortDendrite/bulb.hoc
116094/DendroDendriticInhibition/LongDendrite/bulb.hoc
SFS-IPS-WM-network/ECell.hoc
SFS-IPS-WM-network/Results.hoc
SFS-IPS-WM-network/MultiModuleWMNetXP.hoc
SFS-IPS-WM-network/ICell.hoc
SFS-IPS-WM-network/Net.hoc
SFS-IPS-WM-network/LabCell.hoc
127021/Golgi_cell_NaKATPAse/Synapses.hoc
127021/Golgi_cell_NaKATPAse/Channel_dynamics.hoc
127021/Golgi_cell_NaKATPAse/Save_data.hoc
127021/Golgi_cell_NaKATPAse/Golgi_ComPanel.hoc
127021/Golgi_cell_NaKATPAse/utils.hoc
139656/network/Golgi_ComPanel.hoc
139656/network/utils.hoc
150288/KimEtAl2013/LA_model_main_file.hoc
108458/KampaStuart2006/runRi18.hoc
267106/Lodge_2021_Cell_Rep_GC_Models/GC-Ball.hoc
206244/CA1_multi/experiment/cell-setup_regular.hoc
206244/CA1_multi/experiment/cell-setup_div2.hoc
206244/CA1_multi/experiment/cell-setup_mul2.hoc
186977/Avella_GonzalezEtAl2015/templates/presets_WT1.hoc
184732/FietkiewiczEtAl2016/initialize.hoc
184732/FietkiewiczEtAl2016/main.hoc
184732/FietkiewiczEtAl2016/cellTemplates.hoc
184732/FietkiewiczEtAl2016/run.hoc
DiFrancescoNoble1985/cellinit.hoc
113732/SS-cortex/wiring-config_thresh_percieved.hoc
Golgi_cell/Synapses.hoc
Golgi_cell/Channel_dynamics.hoc
Golgi_cell/Golgi_template.hoc
Golgi_cell/Save_data.hoc
Golgi_cell/Golgi_ComPanel.hoc
Golgi_cell/utils.hoc
network/Golgi_ComPanel.hoc
network/utils.hoc
156120/HAE_LAE_Netk/templates/presets_WT1.hoc
267189/Crbl_tDCS_Zhang2021/DCN/DCN_simulation.hoc
140789/DG_BC/NEURON-models/DG-BasketCell1.hoc
140789/DG_BC/NEURON-models/DG-BasketCell4.hoc
140789/DG_BC/NEURON-models/DG-BasketCell5.hoc
140789/DG_BC/NEURON-models/DG-BasketCell3.hoc
140789/DG_BC/NEURON-models/DG-BasketCell6.hoc
140789/DG_BC/NEURON-models/DG-BasketCell2.hoc
140789/DG_BC/Figure_2/specifiy-BC6.hoc
140789/DG_BC/Figure_2/pipettes.hoc
140789/DG_BC/Figure_2/morph-BC6.hoc
140789/DG_BC/Figure_2/run-BC6.hoc
149739/ACh_ModelDB/OB.hoc
155705/AvellaEtAl2014/Two_netsPaper/templates/presets_WT1.hoc
157157/SaudargieneEtAl2015/main.hoc
2730/bulbNet/bulb.hoc
185513/SudhakarEtAl2015/DCN_params.hoc
144520/DiFrancescoNoble1985/cellinit.hoc
251493/EbnerEtAl2019/Fig2B.hoc
251493/EbnerEtAl2019/Fig3.hoc
266578/MaEtAl2020/2_compartment_template.hoc
266578/MaEtAl2020/SC_template.hoc
266578/MaEtAl2020/PF_template.hoc
Chloride_Model/init_ClmIPSCs_GC.hoc
Chloride_Model/init_ClmIPSCs_GC_single.hoc
108459/LetzkusEtAl2006/runRi18.hoc
7907/dendritica-1.1/batch_back/*
150551/AshhadNarayanan2013/CalciumWave.hoc
112685/Golgi_cell/Synapses.hoc
112685/Golgi_cell/Channel_dynamics.hoc
112685/Golgi_cell/Golgi_template.hoc
112685/Golgi_cell/Save_data.hoc
112685/Golgi_cell/Golgi_ComPanel.hoc
112685/Golgi_cell/utils.hoc
150024/CNModel_May2013/DCN_params_fi_init.hoc
150024/CNModel_May2013/DCN_params.hoc
150024/CNModel_May2013/DCN_params_axis.hoc
150024/CNModel_May2013/DCN_params_rebound.hoc
144523/LuthmanEtAl2011/DCN_simulation.hoc
98017/SFS-IPS-WM-network/ECell.hoc
98017/SFS-IPS-WM-network/Results.hoc
98017/SFS-IPS-WM-network/MultiModuleWMNetXP.hoc
98017/SFS-IPS-WM-network/ICell.hoc
98017/SFS-IPS-WM-network/Net.hoc
98017/SFS-IPS-WM-network/LabCell.hoc
3800/cardiac1998/aboutatrial.hoc
149000/PurkReductionOnLine/useful&InitProc.hoc
Crbl_tDCS_Zhang2021/DCN/DCN_simulation.hoc
YoungEtAl2013/NEURON files/velocity/ALG_PARCHA-130523.hoc
YoungEtAl2013/NEURON files/voltage/ALG_PARCHA-130523.hoc
148253/Chloride_Model/init_ClmIPSCs_GC.hoc
148253/Chloride_Model/init_ClmIPSCs_GC_single.hoc
229750/DDnet/net_dd_emodel.hoc
144482/Pyramidal_STDP_Gomez_Delgado_2010/morphology/cell_1.hoc
144482/Pyramidal_STDP_Gomez_Delgado_2010/experiment/graphs.hoc
144482/Pyramidal_STDP_Gomez_Delgado_2010/experiment/Protocols.hoc
144482/Pyramidal_STDP_Gomez_Delgado_2010/experiment/init.hoc
144482/Pyramidal_STDP_Gomez_Delgado_2010/experiment/savedata.hoc
64229/netmod/parbulbNet/bulb.hoc
64229/netmod/parbulbNet/par_bulb.hoc
145836/MoradiEtAl2012/SynExp2NMDA.mod
145836/MoradiEtAl2012/SynExp3NMDA2.mod
145836/MoradiEtAl2012/SynExp3NMDA.mod
126637/purkinje_ppr/Leak.mod
261714/Cav23/Cav23.mod
121060/MSN2009/chan_inKIR.mod

ramcdougal commented 1 year ago

Not a CI issue, but agree it would be nice to standardize.

To anybody reading this: pull requests to the relevant repositories at https://github.com/modeldbrepository are welcome.

Note that this isn't limited to NEURON models.

ctrl-z-9000-times commented 1 year ago

Hi, I opened a whole bunch of PR...

Here is a list of all of the models that I opened PR's against: 190559 253369 147460 50207 106551 146376 145836 169208 116094 127021 139656 150288 108458 126637 267106 206244 186977 184732 156120 267189 261714 140789 149739 155705 157157 2730 185513 144520 251493 266578 121060 108459 150551 112685 150024 144523 3800 149000 148253 144482 64229 229276 123897 229750 98017 - This PR is non-trivial, see the comments on this PR.

I opened an issue with the following repo: 21329

I could not determine the text encoding for the following models. 185121 7907

Most of these changes are very simple, trivial changes to comments. Take you're time reviewing these! I did this quickly because I wrote a program to do it for me, but reviewing and approving these changes is by necessity a manual and time-consuming process.

ctrl-z-9000-times commented 1 year ago

And here are the tools I used to do this, in case anyone else encounters legacy character encoding formats in the future:

Table of ancient character encodings: https://en.wikipedia.org/wiki/Western_Latin_character_sets_(computing)
chardet is a tool to figure out which encoding a file uses: https://pypi.org/project/chardet/

Here is the messy program that I wrote:

import requests
from pathlib import Path
import subprocess
import json
import os
import sys
import chardet

commit_msg = "Convert character encoding to UTF-8"
user = "YOUR GITHUB USERNAME GOES HERE"
token = "YOUR GITHUB TOKEN GOES HERE"

# Make a cache dir to hold all of the temp files.
cache_dir = Path.cwd().joinpath('tmp_modeldb_repos')
cache_dir.mkdir(exist_ok=True)
os.chdir(cache_dir)

repo_list = []
if True:
    # Get a list of all github repo's owned by user "ModelDBRepository"
    page = 1
    per_page = 100
    while True:
        print(f'Talking to github, requesting info for repo\'s {(page-1)*per_page+1} - {page*per_page} ... ', flush=1, end='')
        page_data = requests.get(
                'https://api.github.com/users/ModelDBRepository/repos',
                params={'page': str(page), 'per_page': str(per_page)},
                auth=(user, token))
        print('status code:', page_data.status_code)
        if page_data.status_code != 200:
            print(page_data.text)
            sys.exit()
        page_data = json.loads(page_data.text)
        page_data = [x['name'] for x in page_data]
        repo_list.extend(page_data)
        page += 1
        if not page_data:
            break
elif True:
    # Use the models that are already downloaded.
    repo_list = list(x.name for x in cache_dir.iterdir())
else:
    # 
    repo_list = ['98017']

print("REPO LIST:", ', '.join(str(x) for x in repo_list), '\n')
print("NUM REPOS:", len(repo_list), '\n')

# Check each repo for non-UTF-8 text files.
changed_repos = []
failed_repos = []
repo_list = [Path(str(x)) for x in repo_list]
for repo in repo_list:

    # Download the git repository.
    if not repo.exists():
        subprocess.run(
                ['git', 'clone', f'https://github.com/ModelDBRepository/{repo}.git'],
                check=True,
                capture_output=True)

    # Find all of the text files.
    hoc = list(repo.glob("**/*.hoc"))
    ses = list(repo.glob("**/*.ses"))
    mod = list(repo.glob("**/*.mod"))
    inc = list(repo.glob("**/*.inc"))

    # Fix the character encoding.
    any_fixed = False
    any_failed = False
    for file in (hoc + ses + mod + inc):
        # Ignore hidden files.
        if file.name.startswith('.'):
            continue
        with file.open('rb') as f:
            raw = f.read()
        # Check if python can decode the string.
        try:
            raw.decode()
            continue
        except Exception as err:
            pass
        # Try to figure out what encoding this data is using.
        detected_encoding = chardet.detect(raw)
        if detected_encoding['encoding'] in {'utf-8', 'ascii'}:
            continue
        if detected_encoding['confidence'] < .5:
            continue
        # 
        try:
            utf8 = raw.decode(detected_encoding['encoding'].lower())
        except Exception as err:
            failed_repos.append(f"{str(file)}: {str(err)}")
            any_failed = True
            break
        # 
        with file.open('wt') as f:
            f.write(utf8)
        print('Fixed', detected_encoding['encoding'], file)
        any_fixed = True

    if any_failed:
        continue

    if any_fixed:
        changed_repos.append(repo)
        subprocess.run(['git', 'commit', '-a', '-m', commit_msg],
                cwd=repo,
                capture_output=True,
                check=True)
    else:
        # Remove unchanged files.
        # subprocess.run(['rm', '-rf', str(repo)], check=True)
        pass

print()
print("FIXED MODELS:")
print('\n'.join(str(x.name) for x in changed_repos))
print("FAILURES:")
print('\n'.join(failed_repos))

olupton commented 1 year ago

Note that some models already have workarounds for this in the CI runs using iconv: https://github.com/neuronsimulator/nrn-modeldb-ci/blob/5c0089252ba592f892f17577fed887e42474ab75/modeldb/modeldb-run.yaml#L1294-L1295

Fixing the problem at source would be much better.

neuronsimulator / nrn-modeldb-ci

ModelDB contains files with invalid UTF-8 #91