guma44 / GEOparse

Python library to access Gene Expression Omnibus Database (GEO)
BSD 3-Clause "New" or "Revised" License
137 stars 51 forks source link

Allow the file encoding to be specified in smart_open() #63

Closed ghost closed 4 years ago

ghost commented 4 years ago

Thank you very much for the effort made in providing this useful package.

I would like to request the following feature: that the encoding can be specified when calling gzip.open() or open() in smart_open().

I am currently using GEOparse 2.0.1 with Python 3.8.3 on Windows 10. I have successfully downloaded GSE files from GEO (e.g. GSE134809_family.soft) and have also used GEOparse to read the .soft (or .soft.gz) files stored locally on my computer.

I have discovered that some special characters in the .soft files are not being interpreted correctly, due to gzip.open() or open() using Python's default encoder ('cp1252' in my computer) instead of 'utf-8' even though the .soft files use 'utf-8' encoding. Due to smart_open() ignoring errors when reading the file with fh = fopen(filepath, mode, errors="ignore"), the special characters do not prevent the file from being read, but they are not interpreted correctly.

The types of characters that I've found to be problematic are letters with accents, and some punctuation marks, e.g. Naïve, 4°C, 3’ prime, “union” (those single and double quotation marks are not the standard ones even though they look similar).

This could be solved by allowing the encoding argument to be passed to gzip.open() or open() when calling smart_open():

@contextmanager
def smart_open(filepath, encoding):
    """Open file intelligently depending on the source and python version.

    Args:
        filepath (:obj:`str`): Path to the file.
        encoding (:obj:`str`): Encoding to use when reading the file.

    Yields:
        Context manager for file handle.

    """
    if filepath[-2:] == "gz":
        mode = "rt"
        fopen = gzip.open
    else:
        mode = "r"
        fopen = open
    if sys.version_info[0] < 3:
        fh = fopen(filepath, mode)
    else:
        fh = fopen(filepath, mode, encoding=encoding)
    try:
        yield fh
    except IOError:
        fh.close()
    finally:
        fh.close()

Alternatively, **kwargs could be passed through smart_open() and into gzip_open() and open().

Additionally, it would be beneficial if the errors were not ignored when reading the files, so that the user can be aware of them. This could be done by using a try/except block to attempt to open the file, and if errors are raised, display them to the user and then try to read the file again but this time ignoring errors. This would mean that the file would still be read but the user would be aware that there was a problem.

guma44 commented 4 years ago

Hi, thanks a lot for the hint and detailed description. I am starting to implement this.

guma44 commented 4 years ago

Hi, the newest version allows to pass the dictionary that will be subsequently passed to the smart_open function. It should do the trick for you. If not, let me know, I will figure out something else.

ghost commented 4 years ago

This is perfect, thank you.