pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.34k stars 17.81k forks source link

read_csv C engine hangs on my server #7752

Closed JakeEhrlich closed 9 years ago

JakeEhrlich commented 10 years ago

so I have a web application where users need to be able to upload csv files. If I use the c engine the whole thing hangs indifferently. I was able to read_csv hanging.

code: data = smart_unicode(data) #convert to unicode properly frames = {"default":pandas.read_csv(io.StringIO(data))}

where 'data' is a Unicode string output by Django's 'smart_unicode'. The same issue occurs if you use data = data.decode('utf-8'). The issue does not occur if I use the python engine however.

I am running an apache 2 server with django 1.6 and mod_wsgi. I have no clue what is causing this however.

jreback commented 10 years ago

try specifying the encoding keyword, see here: http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files

e.g. encoding='utf-8'

JakeEhrlich commented 10 years ago

still hangs.

Also I should note that I don't have root on this server but I can pretty quickly ask the sysadmin for any desired specs. note that he may decline to allow me to post the information here in some cases.

jreback commented 10 years ago

well, you have to be able to decode it first. try with the csv module once you figure out the encoding. you have to specify the correct encoding.

JakeEhrlich commented 10 years ago

the encoding being used is utf-8 and I wrote: data = data.decode('utf-8') frames = {"default":pandas.read_csv(io.StringIO(data), encoding='utf-8')

aside from having done this I think it probably shouldn't matter since StringIO outputs Unicode AND I pass a unicode string to StringIO to being with. Why should the encoding still be specified?

jreback commented 10 years ago

well, I think you have a weird encoding. both read_csv the c-engine and python work well with the encoding option. You have to show that it IS indeed utf-8. If you cannot decode it then pandas won't help.

JakeEhrlich commented 10 years ago

That doesn't make since. It works on my local machine and then doesn't work on my server, no other variables in the code (but there are variables in what server is being used and other system things). I'm not changing the encoding or version of anything anywhere yet just changing the system the code is run on makes the system hang.

jreback commented 10 years ago

its your server. are you sure the versions are identical? the files exactly the same, etc.

JakeEhrlich commented 10 years ago

Ah, I stand corrected on the pandas versions being the same. My local version is '0.12.0' and my server version is '0.14.0.dev' as per pandas.version. Python is 2.7.7 in both places.

0.14.0.dev hangs on my server

jreback commented 10 years ago

ok, well, if you can provide a sample file that hangs. somecan take a look, otherwise pretty tricky. just replace any data with random (and still give a file that hangs).

JakeEhrlich commented 10 years ago

with unicode in a file that hangs: http://pastebin.com/G0UTrnMK without unicode in a file that hangs: http://pastebin.com/mV35bGtV

jreback commented 10 years ago

pls post pd.show_versions()

JakeEhrlich commented 10 years ago

can you explain what 'pd.show_versions()' is?

server:

INSTALLED VERSIONS

commit: None python: 2.7.7.final.0 python-bits: 64 OS: Linux OS-release: 3.14-1-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8

pandas: 0.14.0.dev nose: 1.3.3 Cython: None numpy: 1.8.1 scipy: 0.13.3 statsmodels: 0.4.2 IPython: None sphinx: None patsy: None scikits.timeseries: None dateutil: 1.5 pytz: 2012c bottleneck: None tables: 3.1.1 numexpr: 2.2.2 matplotlib: 1.3.1 openpyxl: 1.7.0 xlrd: 0.9.2 xlwt: 0.7.5 xlsxwriter: None lxml: 3.3.5 bs4: 4.3.2 html5lib: 0.999 bq: None apiclient: None rpy2: None sqlalchemy: None pymysql: None psycopg2: None

and it seems that show_versions is not possible to call on my local version. not sure why.

jreback commented 10 years ago

ahh that was introduced in 0.13 I think in 0.12 u can run pandas/utils/print_versions.py I think

it shows all of he installed versions and your platform info

which for a problem like this is a must

JakeEhrlich commented 10 years ago

there is not such script. I have a folder in the pandas folder called 'util' that contains clipboard, counter, init, py3compat, testing, compat, decorators, misc, and terminal scripts. There is also a 'version' script but it just has the following code in it:

version = '0.12.0' short_version = '0.12.0'

there is also an info.py but it is just a big comment. I looked elsewhere for other possible items but could not find anything.

jreback commented 10 years ago

meant util

hmm thought it was in there

no worries (maybe just post if the platform / locale ) are different (as u said it works on 0.12 right?)

JakeEhrlich commented 10 years ago

Linux Mint 16 (Petra I think it is called), worked fine on 0.12 with no hanging. I am also using https://github.com/teddziuba/django-sslserver to run my local web server and apache with mod_wsgi to run my production server so there could be some drastic diffrences between how the threads and what not are setup.

jreback commented 10 years ago

you might want to try with 0.14 or 0.14.1 (instead of a dev version) http://pandas.pydata.org/pandas-docs/stable/whatsnew.html

that is support for full line comments (which is a possible cause of the issue) depending in your dev version this might or might not be included (not sure what snapshot u r using), this is in 0.14.1

JakeEhrlich commented 10 years ago

there were no comments in the files I gave you that still hang. I'll see if my sysadmin can change the package in someway. I'll try some different versions and see what happens tomorrow.

jreback commented 10 years ago

not what I mean - their was a change in parsing to handle comments it's possible unicode messes with this somehow but definitely try with 0.14.0 (which doesn't have this change)

jreback commented 10 years ago

Both of your pasted files above read correctly for me on 0.14.1 (with and w/o engine=c|python)

The unicode one asked me to specify an encoding, it came up as chinese..... (I didn't write it down), I saved it as utf-8 and it worked just fine.

In [1]: !cat in.csv
Student ID,Grade,GPA,Major,Gender,Ethnicity,SAT,ACT,Math,High school,Class,Expected,TOEFL,Raw Score Pre,Pre %,Post Raw Score,Post %,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q25,Q26,Q27,Q28,Q29,Q30,Bad Data,Do not import,Unknown
1,A,0,Physics,M,Caucasian,2400,36,None,0,Pre-freshman,1950,0,0,0,0,0,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,adsf,adsf,asd
2,A+,0.01,Astronomy,F,Asian,2399,35,Pre-algebra,1,Freshman,1951,1,1,1,1,1,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,asdf,234,2342
3,A-,0.02,Engineering,male,Native American,2398,34,Trigonometry,Yes,Sophmore,1952,2,2,2,2,2,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,23242,d4,432
4,B,0.03,Civil Engineering,Male,Other,2397,33,Pre-calculus,No,Junior,1953,3,3,3,3,3,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,234,wer,23r4
5,B+,0.04,Biology,Female,African American,2396,32,Calculus,TRUE,Senior,1954,4,4,4,4,4,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,234,asdf23,234
In [2]: read_csv('in.csv')
Out[2]: 
   Student ID Grade   GPA              Major  Gender         Ethnicity   SAT  ACT          Math High school         Class  Expected  TOEFL  Raw Score Pre  Pre %  Post Raw Score  Post % Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 ... Q8.1 Q9.1 Q10.1 Q11.1 Q12.1 Q13.1 Q14.1 Q15.1 Q16.1 Q17.1 Q18.1 Q19.1 Q20.1 Q21.1  \
0           1     A  0.00            Physics       M         Caucasian  2400   36          None           0  Pre-freshman      1950      0              0      0               0       0  A  A  A  A  A  A  A  A ...    A    A     A     A     A     A     A     A     A     A     A     A     A     A   
1           2    A+  0.01          Astronomy       F             Asian  2399   35   Pre-algebra           1      Freshman      1951      1              1      1               1       1  B  B  B  B  B  B  B  B ...    B    B     B     B     B     B     B     B     B     B     B     B     B     B   
2           3    A-  0.02        Engineering    male   Native American  2398   34  Trigonometry         Yes      Sophmore      1952      2              2      2               2       2  C  C  C  C  C  C  C  C ...    C    C     C     C     C     C     C     C     C     C     C     C     C     C   
3           4     B  0.03  Civil Engineering    Male             Other  2397   33  Pre-calculus          No        Junior      1953      3              3      3               3       3  D  D  D  D  D  D  D  D ...    D    D     D     D     D     D     D     D     D     D     D     D     D     D   
4           5    B+  0.04            Biology  Female  African American  2396   32      Calculus        TRUE        Senior      1954      4              4      4               4       4  E  E  E  E  E  E  E  E ...    E    E     E     E     E     E     E     E     E     E     E     E     E     E   

  Q22.1 Q23.1 Q25.1 Q26.1 Q27.1 Q28.1 Q29.1 Q30.1 Bad Data Do not import Unknown  
0     A     A     A     A     A     A     A     A     adsf          adsf     asd  
1     B     B     B     B     B     B     B     B     asdf           234    2342  
2     C     C     C     C     C     C     C     C    23242            d4     432  
3     D     D     D     D     D     D     D     D      234           wer    23r4  
4     E     E     E     E     E     E     E     E      234        asdf23     234  

[5 rows x 79 columns]
JakeEhrlich commented 10 years ago

I'll see if changing to 0.14.1 fixes the issue

jreback commented 9 years ago

closing as not repro