sys-bio / temp-biomodels

Temporary place for coordination of updating existing Biomodels
Creative Commons Zero v1.0 Universal
2 stars 2 forks source link

Character encoding issues, particular in copied abstracts #100

Closed luciansmith closed 2 years ago

luciansmith commented 2 years ago

Many SBML files have character encoding issues, particularly in author lists and abstracts where things are cut and pasted in from one character encoding scheme to a different one. In many cases, this makes it impossible for (say) Python to read the file directly (though it can be read by libsbml anyway. Need to investigate what it does with the encoding.) This should be tested at least, and ideally fixed.

See https://github.com/sys-bio/temp-biomodels/commit/053cdc40260030a656ab3f69937989db28c33f88 for manual fixes.

jonrkarr commented 2 years ago

Seems somewhat hard to fix.

This StackOverflow post suggests a few Python packages for dealing with this https://stackoverflow.com/questions/436220/how-to-determine-the-encoding-of-text

luciansmith commented 2 years ago

If we can't fix it, just noticing it and reporting to the curator should be sufficient, then.

luciansmith commented 2 years ago

(There may be a subset we could recognize and fix--just fixing smart quotes and dashes would get maybe 80% of the way there.)

jonrkarr commented 2 years ago

I think that 053cdc40260030a656ab3f69937989db28c33f88 distorts some of the meaning by removing the bad characters. Some appear to need to be replaced with Greek characters e.g., \mu M

luciansmith commented 2 years ago

I'm sure that's the case. In many (if not most) cases, I couldn't even tell what the original character was supposed to be.

jonrkarr commented 2 years ago

If we don't know what the character is intended to be, I'd suggest replacing the bad character with ?. This would communicate the uncertainty rather than distorting the meaning.

jonrkarr commented 2 years ago

See #104 for recommendations for how to do this.

luciansmith commented 2 years ago

OK! This was actually simply my own fault for trying to read in UTF-8 files (which the XML is expressly labeled as) in Python with a bare 'open' command. It turns out that if you're reading a UTF-8 file, you need to say this explicitly:

        f = open(file, "r", encoding="utf-8")

Thank for the pointers to the python packages; they were helpful in making me realize the files were always in utf-8 just like they should be.