Correct character encoding issues in SED-ML files

jonrkarr commented 2 years ago

Similar to #100, there are also weirdly encoded characters in the SED-ML files which would ideally be fixed.

Example: https://github.com/sys-bio/temp-biomodels/blob/main/final/BIOMD0000000585/PC_100.sedml#L479 (Î³ rather than γ)

Todo

For each SED-ML file,

Open the SED-ML file
Identify such odd characters in the file. #100 has notes about useful Python packages for correcting this.
If the intended character can be discerned, replace it with the intended utf8 character. If not, replace it the character with a ?
Save the modified SED-ML

This should be done by

[ ] Write a Python script, using any of the fix_*.py scripts in this root of this repository as a template
[ ] Import this new script into fix-entries.py.
[ ] Incorporate validation for this into validate_sedml_file

luciansmith commented 2 years ago

I tracked this down to our problem, not Biomodels: the 'Revised output names.csv' file I downloaded from Google Sheets is apparently encoded in UTF-8, but the CVS reader in Python doesn't know this by default. (Opening the file in Excel also confuses it, so Python isn't alone in this regard.)

Just like #100, the problems were entirely on our/Python's end, and don't represent a problem with existing models or files in Biomodels. Obviously it's possible to create these problems (as we have shown) and it might be good to write something that tests this, but I'm not sure what that would look like. Given that the actual problems are fixed, do you think we should close this issue?

jonrkarr commented 2 years ago

Sounds like this can be closed. I'll create another issue for the more general problem.

sys-bio / temp-biomodels

Correct character encoding issues in SED-ML files #104

Todo