guma44 / GEOparse

Python library to access Gene Expression Omnibus Database (GEO)
BSD 3-Clause "New" or "Revised" License
137 stars 51 forks source link

Suggestion for an improvement of the GEOparse.__parse_entry() function #75

Open abysslover opened 2 years ago

abysslover commented 2 years ago

When the entry_line parameter in the GEOparse.__parse_entry() function contains more than an equal sign, the current function raises an exception(GEOparse.GEOTypes.DataIncompatibilityException) as follows:

gpl = GEOparse.get_GEO(geo="GPL6101", silent=True, include_data=True, destdir=".")

(<class 'GEOparse.GEOTypes.DataIncompatibilityException'>, DataIncompatibilityException('\nData columns do not match columns description index in GSM665713\nColumns in table are: ID_REF, VALUE, T0-S(0)-2=S(15).Detection Pval\nIndex in columns are: ID_REF, VALUE, T0-S(0)-2\n',), <traceback object at 0x7f57f4dd9688>)

The line causing the above exception is as follows:

#T0-S(0)-2=S(15).Detection Pval =

columns variable taken from GEOparse.parse_columns(soft) looks like as follows:

Index(['ID_REF', 'VALUE', 'T0-S(0)-2'], dtype='object')

Meanwhile, GEOparse.parse_table_data(soft) correctly parsed the SOFT data as follows:

Index(['ID_REF', 'VALUE', 'T0-S(0)-2=S(15).Detection Pval'], dtype='object')

Thus, I suggest to modify the GEOparse.__parse_entry() function as follows :

def __parse_entry(entry_line):
    """Parse the SOFT file entry name line that starts with '^', '!' or '#'.

    Args:
        entry_line (:obj:`str`): Line from SOFT  to be parsed.

    Returns:
        :obj:`2-tuple`: Type of entry, value of entry.

    """
    if entry_line.startswith("!"):
        entry_line = sub(r"!\w*?_", "", entry_line)
    else:
        entry_line = entry_line.strip()[1:]
    n_equal_sign = entry_line.count("=")
    try:
        if 1 == n_equal_sign:
            entry_type, entry_name = [i.strip() for i in entry_line.split("=", maxsplit=1)]
        else:
            entry_type, entry_name = [i.strip() for i in split(" = ?", entry_line, maxsplit=1)]
    except ValueError:
        if 1 == n_equal_sign:
            entry_type = [i.strip() for i in entry_line.split("=", maxsplit=1)][0]
        else:
            entry_type = [i.strip() for i in split(" = ?", entry_line, maxsplit=1)][0]
        entry_name = ""
    return entry_type, entry_name