Closed luciansmith closed 2 years ago
Seems somewhat hard to fix.
This StackOverflow post suggests a few Python packages for dealing with this https://stackoverflow.com/questions/436220/how-to-determine-the-encoding-of-text
If we can't fix it, just noticing it and reporting to the curator should be sufficient, then.
(There may be a subset we could recognize and fix--just fixing smart quotes and dashes would get maybe 80% of the way there.)
I think that 053cdc40260030a656ab3f69937989db28c33f88 distorts some of the meaning by removing the bad characters. Some appear to need to be replaced with Greek characters e.g., \mu M
I'm sure that's the case. In many (if not most) cases, I couldn't even tell what the original character was supposed to be.
If we don't know what the character is intended to be, I'd suggest replacing the bad character with ?
. This would communicate the uncertainty rather than distorting the meaning.
See #104 for recommendations for how to do this.
OK! This was actually simply my own fault for trying to read in UTF-8 files (which the XML is expressly labeled as) in Python with a bare 'open' command. It turns out that if you're reading a UTF-8 file, you need to say this explicitly:
f = open(file, "r", encoding="utf-8")
Thank for the pointers to the python packages; they were helpful in making me realize the files were always in utf-8 just like they should be.
Many SBML files have character encoding issues, particularly in author lists and abstracts where things are cut and pasted in from one character encoding scheme to a different one. In many cases, this makes it impossible for (say) Python to read the file directly (though it can be read by libsbml anyway. Need to investigate what it does with the encoding.) This should be tested at least, and ideally fixed.
See https://github.com/sys-bio/temp-biomodels/commit/053cdc40260030a656ab3f69937989db28c33f88 for manual fixes.