guma44 / GEOparse

Python library to access Gene Expression Omnibus Database (GEO)
BSD 3-Clause "New" or "Revised" License
137 stars 51 forks source link

Phenotype data does not use pandas dtype inference #74

Open hardingnj opened 2 years ago

hardingnj commented 2 years ago

By skipping the read_csv function, we lose the detection of nan values, so columns that are numeric are coded as objects.

ie

import GEOparse

geo = GEOparse.get_GEO("GSE112676")

geo.phenotype_data["characteristics_ch1.3.age_onset"]

gives

GSM3076582    72.69
GSM3076584    66.97
GSM3076586    73.73
GSM3076588       NA
GSM3076590       NA
              ...  
GSM3078502    74.88
GSM3078503    73.57
GSM3078505    71.29
GSM3078507    61.84
GSM3078510    74.49
Name: characteristics_ch1.3.age_onset, Length: 741, dtype: object

So despite being "NA" strings, they are not interpreted as being consistent with floats.

my fix is something like this:

from io import StringIO
out = StringIO()
pheno.to_csv(out)
pheno = pd.read_csv(StringIO(out.getvalue()), index_col=0)

I can put in a quick PR, but it feels a little crude to do this, but I haven't been able to find a more elegant way.

guma44 commented 2 years ago

Thanks for reporting. Let me think how to do this - maybe a PR would be good to do so we can test it.