Closed JakeEhrlich closed 9 years ago
try specifying the encoding
keyword, see here: http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files
e.g. encoding='utf-8'
still hangs.
Also I should note that I don't have root on this server but I can pretty quickly ask the sysadmin for any desired specs. note that he may decline to allow me to post the information here in some cases.
well, you have to be able to decode it first. try with the csv module once you figure out the encoding. you have to specify the correct encoding.
the encoding being used is utf-8 and I wrote: data = data.decode('utf-8') frames = {"default":pandas.read_csv(io.StringIO(data), encoding='utf-8')
aside from having done this I think it probably shouldn't matter since StringIO outputs Unicode AND I pass a unicode string to StringIO to being with. Why should the encoding still be specified?
well, I think you have a weird encoding. both read_csv the c-engine and python work well with the encoding option. You have to show that it IS indeed utf-8
. If you cannot decode it then pandas won't help.
That doesn't make since. It works on my local machine and then doesn't work on my server, no other variables in the code (but there are variables in what server is being used and other system things). I'm not changing the encoding or version of anything anywhere yet just changing the system the code is run on makes the system hang.
its your server. are you sure the versions are identical? the files exactly the same, etc.
Ah, I stand corrected on the pandas versions being the same. My local version is '0.12.0' and my server version is '0.14.0.dev' as per pandas.version. Python is 2.7.7 in both places.
0.14.0.dev hangs on my server
ok, well, if you can provide a sample file that hangs. somecan take a look, otherwise pretty tricky. just replace any data with random (and still give a file that hangs).
with unicode in a file that hangs: http://pastebin.com/G0UTrnMK without unicode in a file that hangs: http://pastebin.com/mV35bGtV
pls post pd.show_versions()
can you explain what 'pd.show_versions()' is?
server:
commit: None python: 2.7.7.final.0 python-bits: 64 OS: Linux OS-release: 3.14-1-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8
pandas: 0.14.0.dev nose: 1.3.3 Cython: None numpy: 1.8.1 scipy: 0.13.3 statsmodels: 0.4.2 IPython: None sphinx: None patsy: None scikits.timeseries: None dateutil: 1.5 pytz: 2012c bottleneck: None tables: 3.1.1 numexpr: 2.2.2 matplotlib: 1.3.1 openpyxl: 1.7.0 xlrd: 0.9.2 xlwt: 0.7.5 xlsxwriter: None lxml: 3.3.5 bs4: 4.3.2 html5lib: 0.999 bq: None apiclient: None rpy2: None sqlalchemy: None pymysql: None psycopg2: None
and it seems that show_versions is not possible to call on my local version. not sure why.
ahh that was introduced in 0.13 I think in 0.12 u can run pandas/utils/print_versions.py I think
it shows all of he installed versions and your platform info
which for a problem like this is a must
there is not such script. I have a folder in the pandas folder called 'util' that contains clipboard, counter, init, py3compat, testing, compat, decorators, misc, and terminal scripts. There is also a 'version' script but it just has the following code in it:
version = '0.12.0' short_version = '0.12.0'
there is also an info.py but it is just a big comment. I looked elsewhere for other possible items but could not find anything.
meant util
hmm thought it was in there
no worries (maybe just post if the platform / locale ) are different (as u said it works on 0.12 right?)
Linux Mint 16 (Petra I think it is called), worked fine on 0.12 with no hanging. I am also using https://github.com/teddziuba/django-sslserver to run my local web server and apache with mod_wsgi to run my production server so there could be some drastic diffrences between how the threads and what not are setup.
you might want to try with 0.14 or 0.14.1 (instead of a dev version) http://pandas.pydata.org/pandas-docs/stable/whatsnew.html
that is support for full line comments (which is a possible cause of the issue) depending in your dev version this might or might not be included (not sure what snapshot u r using), this is in 0.14.1
there were no comments in the files I gave you that still hang. I'll see if my sysadmin can change the package in someway. I'll try some different versions and see what happens tomorrow.
not what I mean - their was a change in parsing to handle comments it's possible unicode messes with this somehow but definitely try with 0.14.0 (which doesn't have this change)
Both of your pasted files above read correctly for me on 0.14.1 (with and w/o engine=c|python
)
The unicode one asked me to specify an encoding, it came up as chinese.....
(I didn't write it down), I saved it as utf-8
and it worked just fine.
In [1]: !cat in.csv
Student ID,Grade,GPA,Major,Gender,Ethnicity,SAT,ACT,Math,High school,Class,Expected,TOEFL,Raw Score Pre,Pre %,Post Raw Score,Post %,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q24,Q25,Q26,Q27,Q28,Q29,Q30,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20,Q21,Q22,Q23,Q25,Q26,Q27,Q28,Q29,Q30,Bad Data,Do not import,Unknown
1,A,0,Physics,M,Caucasian,2400,36,None,0,Pre-freshman,1950,0,0,0,0,0,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,A,adsf,adsf,asd
2,A+,0.01,Astronomy,F,Asian,2399,35,Pre-algebra,1,Freshman,1951,1,1,1,1,1,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,B,asdf,234,2342
3,A-,0.02,Engineering,male,Native American,2398,34,Trigonometry,Yes,Sophmore,1952,2,2,2,2,2,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,23242,d4,432
4,B,0.03,Civil Engineering,Male,Other,2397,33,Pre-calculus,No,Junior,1953,3,3,3,3,3,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,D,234,wer,23r4
5,B+,0.04,Biology,Female,African American,2396,32,Calculus,TRUE,Senior,1954,4,4,4,4,4,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,234,asdf23,234
In [2]: read_csv('in.csv')
Out[2]:
Student ID Grade GPA Major Gender Ethnicity SAT ACT Math High school Class Expected TOEFL Raw Score Pre Pre % Post Raw Score Post % Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 ... Q8.1 Q9.1 Q10.1 Q11.1 Q12.1 Q13.1 Q14.1 Q15.1 Q16.1 Q17.1 Q18.1 Q19.1 Q20.1 Q21.1 \
0 1 A 0.00 Physics M Caucasian 2400 36 None 0 Pre-freshman 1950 0 0 0 0 0 A A A A A A A A ... A A A A A A A A A A A A A A
1 2 A+ 0.01 Astronomy F Asian 2399 35 Pre-algebra 1 Freshman 1951 1 1 1 1 1 B B B B B B B B ... B B B B B B B B B B B B B B
2 3 A- 0.02 Engineering male Native American 2398 34 Trigonometry Yes Sophmore 1952 2 2 2 2 2 C C C C C C C C ... C C C C C C C C C C C C C C
3 4 B 0.03 Civil Engineering Male Other 2397 33 Pre-calculus No Junior 1953 3 3 3 3 3 D D D D D D D D ... D D D D D D D D D D D D D D
4 5 B+ 0.04 Biology Female African American 2396 32 Calculus TRUE Senior 1954 4 4 4 4 4 E E E E E E E E ... E E E E E E E E E E E E E E
Q22.1 Q23.1 Q25.1 Q26.1 Q27.1 Q28.1 Q29.1 Q30.1 Bad Data Do not import Unknown
0 A A A A A A A A adsf adsf asd
1 B B B B B B B B asdf 234 2342
2 C C C C C C C C 23242 d4 432
3 D D D D D D D D 234 wer 23r4
4 E E E E E E E E 234 asdf23 234
[5 rows x 79 columns]
I'll see if changing to 0.14.1 fixes the issue
closing as not repro
so I have a web application where users need to be able to upload csv files. If I use the c engine the whole thing hangs indifferently. I was able to read_csv hanging.
code: data = smart_unicode(data) #convert to unicode properly frames = {"default":pandas.read_csv(io.StringIO(data))}
where 'data' is a Unicode string output by Django's 'smart_unicode'. The same issue occurs if you use data = data.decode('utf-8'). The issue does not occur if I use the python engine however.
I am running an apache 2 server with django 1.6 and mod_wsgi. I have no clue what is causing this however.