guma44 / GEOparse

Python library to access Gene Expression Omnibus Database (GEO)
BSD 3-Clause "New" or "Revised" License
137 stars 51 forks source link

Suggestion for an improvement of the GEOparse.utils.smart_open() function #76

Open abysslover opened 2 years ago

abysslover commented 2 years ago

I found that some GEO files contain carriage return characters in the meta data, causing exceptions (GEOparse.GEOTypes.DataIncompatibilityException). To reproduce the error you can test functions with "GPL10740" dataset as follows:

gpl = GEOparse.get_GEO(geo="GPL10740", silent=True, include_data=True, destdir=".")

(<class 'GEOparse.GEOTypes.DataIncompatibilityException'>, DataIncompatibilityException('\nData columns do not match columns description index in GSM1530106\nColumns in table are: )\nIndex in columns are: ID_REF, VALUE, DETECTION P-VALUE\n',), <traceback object at 0x7f1fee64be48>)

columns variable taken from GEOparse.parse_columns(soft) is:

Index(['ID_REF', 'VALUE', 'DETECTION P-VALUE'], dtype='object')

table_data.columns variable taken from GEOparse.parse_table_data(soft) is: Index([')'], dtype='object')

This is due to the line containing a carriage return:

!Sample_relation = Alternative to: GSM1530054 (gene-level analysis^M)
!Sample_series_id = GSE62617
!Sample_series_id = GSE70707
#ID_REF =
#VALUE = RMA normalized signal intensity
#DETECTION P-VALUE =
!sample_table_begin
ID_REF  VALUE   DETECTION P-VALUE

I suggest a small modification on the GEOparse.utils.smart_open() function for working with such a dataset as follows:

@contextmanager
def smart_open(filepath, **open_kwargs):
    """Open file intelligently depending on the source and python version.

    Args:
        filepath (:obj:`str`): Path to the file.

    Yields:
        Context manager for file handle.

    """
    if "errors" not in open_kwargs:
        open_kwargs["errors"] = "ignore"
    if filepath[-2:] == "gz":
        open_kwargs["mode"] = "rt"
        fopen = gzip.open
    else:
        open_kwargs["mode"] = "r"
        fopen = open
    open_kwargs["newline"] = "\n"
    # I do not know why here is an "if" statement because this always calls fopen with the same parameters. 
    if sys.version_info[0] < 3:
        fh = fopen(filepath, **open_kwargs)
    else:
        fh = fopen(filepath, **open_kwargs)
    try:
        yield fh
    except IOError:
        fh.close()
    finally:
        fh.close()