I found that some GEO files contain carriage return characters in the meta data, causing exceptions (GEOparse.GEOTypes.DataIncompatibilityException). To reproduce the error you can test functions with "GPL10740" dataset as follows:
gpl = GEOparse.get_GEO(geo="GPL10740", silent=True, include_data=True, destdir=".")
(<class 'GEOparse.GEOTypes.DataIncompatibilityException'>, DataIncompatibilityException('\nData columns do not match columns description index in GSM1530106\nColumns in table are: )\nIndex in columns are: ID_REF, VALUE, DETECTION P-VALUE\n',), <traceback object at 0x7f1fee64be48>)
columns variable taken from GEOparse.parse_columns(soft) is:
table_data.columns variable taken from GEOparse.parse_table_data(soft) is:
Index([')'], dtype='object')
This is due to the line containing a carriage return:
!Sample_relation = Alternative to: GSM1530054 (gene-level analysis^M)
!Sample_series_id = GSE62617
!Sample_series_id = GSE70707
#ID_REF =
#VALUE = RMA normalized signal intensity
#DETECTION P-VALUE =
!sample_table_begin
ID_REF VALUE DETECTION P-VALUE
I suggest a small modification on the GEOparse.utils.smart_open() function for working with such a dataset as follows:
@contextmanager
def smart_open(filepath, **open_kwargs):
"""Open file intelligently depending on the source and python version.
Args:
filepath (:obj:`str`): Path to the file.
Yields:
Context manager for file handle.
"""
if "errors" not in open_kwargs:
open_kwargs["errors"] = "ignore"
if filepath[-2:] == "gz":
open_kwargs["mode"] = "rt"
fopen = gzip.open
else:
open_kwargs["mode"] = "r"
fopen = open
open_kwargs["newline"] = "\n"
# I do not know why here is an "if" statement because this always calls fopen with the same parameters.
if sys.version_info[0] < 3:
fh = fopen(filepath, **open_kwargs)
else:
fh = fopen(filepath, **open_kwargs)
try:
yield fh
except IOError:
fh.close()
finally:
fh.close()
I found that some GEO files contain carriage return characters in the meta data, causing exceptions (GEOparse.GEOTypes.DataIncompatibilityException). To reproduce the error you can test functions with "GPL10740" dataset as follows:
columns
variable taken fromGEOparse.parse_columns(soft)
is:Index(['ID_REF', 'VALUE', 'DETECTION P-VALUE'], dtype='object')
table_data.columns
variable taken fromGEOparse.parse_table_data(soft)
is:Index([')'], dtype='object')
This is due to the line containing a carriage return:
I suggest a small modification on the
GEOparse.utils.smart_open()
function for working with such a dataset as follows: