sys-bio / temp-biomodels

Temporary place for coordination of updating existing Biomodels
Creative Commons Zero v1.0 Universal
2 stars 2 forks source link

Correct character encoding issues in SED-ML files #104

Closed jonrkarr closed 2 years ago

jonrkarr commented 2 years ago

Similar to #100, there are also weirdly encoded characters in the SED-ML files which would ideally be fixed.

Example: https://github.com/sys-bio/temp-biomodels/blob/main/final/BIOMD0000000585/PC_100.sedml#L479 (γ rather than γ)

Todo

For each SED-ML file,

  1. Open the SED-ML file
  2. Identify such odd characters in the file. #100 has notes about useful Python packages for correcting this.
  3. If the intended character can be discerned, replace it with the intended utf8 character. If not, replace it the character with a ?
  4. Save the modified SED-ML

This should be done by

luciansmith commented 2 years ago

I tracked this down to our problem, not Biomodels: the 'Revised output names.csv' file I downloaded from Google Sheets is apparently encoded in UTF-8, but the CVS reader in Python doesn't know this by default. (Opening the file in Excel also confuses it, so Python isn't alone in this regard.)

Just like #100, the problems were entirely on our/Python's end, and don't represent a problem with existing models or files in Biomodels. Obviously it's possible to create these problems (as we have shown) and it might be good to write something that tests this, but I'm not sure what that would look like. Given that the actual problems are fixed, do you think we should close this issue?

jonrkarr commented 2 years ago

Sounds like this can be closed. I'll create another issue for the more general problem.