guma44 / GEOparse

Python library to access Gene Expression Omnibus Database (GEO)
BSD 3-Clause "New" or "Revised" License
137 stars 51 forks source link

DtypeWarning: Columns (7) have mixed types. #70

Open CholoTook opened 3 years ago

CholoTook commented 3 years ago

The following code is generating a warning for me:

import GEOparse
gpl = GEOparse.get_GEO('GPL17481')

The output is:

>>> import GEOparse
>>> gpl = GEOparse.get_GEO('GPL17481')
17-May-2021 13:32:21 DEBUG utils - Directory ./ already exists. Skipping.
17-May-2021 13:32:21 INFO GEOparse - Downloading http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL17481&form=text&view=full to ./GPL17481.
txt
17-May-2021 13:32:23 DEBUG downloader - Total size: 0
17-May-2021 13:32:23 DEBUG downloader - md5: None
1.72MB [00:00,1.63MB/s]
10.3MB [00:01, 7.26MB/s]
17-May-2021 13:32:24 DEBUG downloader - Moving /tmp/tmp2lblbvso to /home/dbolser/Geromics/Dogome/Geromics/GPL17481.txt
17-May-2021 13:32:24 DEBUG downloader - Successfully downloaded http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL17481&form=text&view=full
17-May-2021 13:32:24 INFO GEOparse - Parsing ./GPL17481.txt: 
17-May-2021 13:32:24 DEBUG GEOparse - PLATFORM: GPL17481
/usr/bin/bpython3:1: DtypeWarning: Columns (7) have mixed types.Specify dtype option on import or set low_memory=False.
  #!/usr/bin/python3
>>> 

I get that this error is coming from pandas, but I'm not sure how to fix it.

guma44 commented 3 years ago

Hi, let me look at it. There is probably something strange in the GPL file. Maybe editing - the file would do the trick. Assuming this is only one timer this could be a good solution. Anyway, taking look at the GPL file would shed some light on what is really the reason.

CholoTook commented 3 years ago

Could it be that the chromosome column starts out as an int, and then becomes a str?

!platform_table_begin
ID      CHROMOSOME      Position        SNP     Plus/Minus Strand       CanineHD_A.bpm.Address  SPOT_ID SNP_ID
BICF2G630100019 25      34549096        [A/G]   BOT     25732300        BICF2G630100019 
BICF2G630100032 25      34560607        [A/G]   BOT     18759386        BICF2G630100032 
BICF2G630100034 25      34561954        [A/G]   BOT     13789354        BICF2G630100034 
BICF2G630100043 25      34587072        [A/G]   BOT     32780356        BICF2G630100043 
BICF2G630100054 25      34604596        [T/C]   BOT     21757302        BICF2G630100054 
BICF2G630100063 25      34615165        [A/G]   BOT     51809461        BICF2G630100063 
BICF2G630100075 25      34638645        [A/C]   BOT     55806509        BICF2G630100075 
BICF2G63010009  X       95382735        [T/C]   BOT     41613463        BICF2G63010009  
BICF2G630100090 25      34688200        [T/C]   BOT     51730475        BICF2G630100090 
BICF2G630100094 25      34689509        [A/T]   BOT     53724487        BICF2G630100094 
BICF2G63010010  X       95373856        [A/G]   BOT     49675468        BICF2G63010010  

Pandas may guess that it's an int and then get confused... As I said, I'm not super familiar with pandas, but I suppose there is a way to let it know the datatype of each column. However, I don't know how GEOparse invokes Pandas.

guma44 commented 3 years ago

Indeed, this seems that this is a problem. Currently, the package does not allow to pass kwargs to Pandas. However, if the code is in some script and it influences the behaviour you could convert the type after the data is read.

CholoTook commented 3 years ago

Seems not to cause any problem TBH. It's just a bit of a weird looking error..

You could probably get away with the low_memory=False flag by default?

Thanks for help, Dan.