pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.77k stars 17.63k forks source link

Unable to read some files from remote drive #12020

Closed kpillman closed 8 years ago

kpillman commented 8 years ago

md5sum: bb39ad5a080b647ecf245ae126a8eb93 remote_drive/mir-count.csv bb39ad5a080b647ecf245ae126a8eb93 local_drive/mir-count.csv

in python import pandas as pd

print pd.version -# Shows 0.17.1

LOCAL_FILE="local_drive/mir-count.csv" REMOTE_FILE="remote_drive/mir-count.csv"

pd.read_csv(LOCAL_FILE) -# File reads without errors.

for l in open(mir_expression_file, 'r'): print l -#Prints lines of file, no errors.

pd.read_csv(REMOTE_FILE) -# Exception stack trace: Traceback (most recent call last): File "/home/kpillman/localwork/bioinformatics/scripts/project_specific/conn_circrna/circrna_count_mir_targets.py", line 22, in mir_expression_df = pd.read_csv(mir_expression_file, index_col=0) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 498, in parser_f return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 275, in _read parser = TextFileReader(filepath_or_buffer, kwds) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 590, in init self._make_engine(self.engine) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 731, in _make_engine self._engine = CParserWrapper(self.f, self.options) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1103, in init self._reader = _parser.TextReader(src, **kwds) File "pandas/parser.pyx", line 518, in pandas.parser.TextReader.cinit (pandas/parser.c:5030) ValueError: No columns to parse from file

Notes: Not every file on the mounted drive fails to be read. May do fail though and it is reproducible which files will fail and which will not.

I am using cifs: /etc/fstab line: //sacgf.ersa.edu.au/sacgf /data/sacgf cifs uid=kpillman,gid=kpillman,credentials=/home/kpillman/.smbcredentials,_netdev 0 0

jreback commented 8 years ago

If the file is read correctly locally and local path == remote path (at least in your description) then not sure why this would be a pandas issue

kpillman commented 8 years ago

Sorry, I used the 'greater than' and 'less than' symbols and consequently the distinguishing part of the path was formatted to be invisible.

jreback commented 8 years ago

not really sure what your issues is. as your example is not reproducible.

davmlaw commented 8 years ago

Loading a file off a local disk is fine. Loading the same file from a remote disk (mounted in Linux using CIFS) causes Pandas to throw an error (No columns to parse from file) or worst of all SILENTLY TRUNCATING DATA

No Unix tools experience this behavior (you can diff/cat/md5sum it all day) or any other program (eg opening the CSV in Open Office). It's only with Pandas.

Yes, it's hard to reproduce due to perhaps needing a CIFS filesystem under heavy load, but don't you think silently truncating user data is sufficiently bad to investigate?

If you have any questions, or would like us to run debug code etc we are happy to.

jreback commented 8 years ago

that's hardly proof

you are welcome to investigate

but we would need a reproduction in order to test

kpillman commented 8 years ago

I understand that you think it will be too difficult to investigate without a reproducible example.

Are you also saying that you don't believe we have sufficient evidence to prove this is a pandas problem?

If you are not yet convinced that this is a pandas problem, what can we do at this end that confirm or disprove this to you?

Even if you choose not to investigate this, for the sake of the community it would be best to confirm whether this is a pandas problem or not.

jreback commented 8 years ago

You have not proved its a pandas problem, since it works on the local drive. You would have to debug it and see exactly what is going on.

Try things like reading with nrows= e.g. read parts of the file, then use skiprows to read other parts of the file. you can also pass it an open file handle (rather than an actual filename).

I suppose its possible its something pandas is doing, but I would say its highly unlikely as this is the first report like this I have seen in the last 4 years.

kpillman commented 8 years ago

Thanks for the suggestions, they are helpful. I had not thought of using a file handle.

When testing the file handle method, I ran across what seems like a solution to the 'ValueError: No columns to parse from file' problem but don't understand what this tells us about why it failed in the first place and why some files fail and some pass on the remote system (while all pass on a local one).

Long story short, this works: df = pd.read_csv(REMOTE_FILE, engine='python')

While this produces the ValueError: df = pd.read_csv(REMOTE_FILE)

Do you know why certain remote files seem to require engine='python' to be read?

jreback commented 8 years ago

the default is to read with engine='c' which is the high performance c-engine. engine='python' is a python based read (quite a bit slower, but exists for compat).

I would suspect its either a buffering problem, or the file is terminating early (e.g. sending EOF).

kpillman commented 8 years ago

Thanks again for the ideas.

By they way, the way I figured this out was that when I used a file handle, the error message was helpful: pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

I found that using engine='python' fixed the problem using either the file handle or file path. Is it feasible to change to error message from the when read_csv is used with a file path to include the engine='python' suggestion?

jreback commented 8 years ago

how exactly are you using a file handle. should be

with open(...) as fh:
    result = pd.read_csv(fh)
kpillman commented 8 years ago

I was doing it like this:

fh1 = open(REMOTE_FILE, 'r')
df = pd.read_csv(fh1, engine='python')

But your format gets the same result.

kpillman commented 8 years ago

For the record, using engine='python' also seems to have fixed the problem of some files being silently truncated while being read.