Closed kpillman closed 8 years ago
If the file is read correctly locally and local path == remote path (at least in your description) then not sure why this would be a pandas issue
Sorry, I used the 'greater than' and 'less than' symbols and consequently the distinguishing part of the path was formatted to be invisible.
not really sure what your issues is. as your example is not reproducible.
Loading a file off a local disk is fine. Loading the same file from a remote disk (mounted in Linux using CIFS) causes Pandas to throw an error (No columns to parse from file) or worst of all SILENTLY TRUNCATING DATA
No Unix tools experience this behavior (you can diff/cat/md5sum it all day) or any other program (eg opening the CSV in Open Office). It's only with Pandas.
Yes, it's hard to reproduce due to perhaps needing a CIFS filesystem under heavy load, but don't you think silently truncating user data is sufficiently bad to investigate?
If you have any questions, or would like us to run debug code etc we are happy to.
that's hardly proof
you are welcome to investigate
but we would need a reproduction in order to test
I understand that you think it will be too difficult to investigate without a reproducible example.
Are you also saying that you don't believe we have sufficient evidence to prove this is a pandas problem?
If you are not yet convinced that this is a pandas problem, what can we do at this end that confirm or disprove this to you?
Even if you choose not to investigate this, for the sake of the community it would be best to confirm whether this is a pandas problem or not.
You have not proved its a pandas problem, since it works on the local drive. You would have to debug it and see exactly what is going on.
Try things like reading with nrows=
e.g. read parts of the file, then use skiprows
to read other parts of the file. you can also pass it an open file handle (rather than an actual filename).
I suppose its possible its something pandas is doing, but I would say its highly unlikely as this is the first report like this I have seen in the last 4 years.
Thanks for the suggestions, they are helpful. I had not thought of using a file handle.
When testing the file handle method, I ran across what seems like a solution to the 'ValueError: No columns to parse from file' problem but don't understand what this tells us about why it failed in the first place and why some files fail and some pass on the remote system (while all pass on a local one).
Long story short, this works:
df = pd.read_csv(REMOTE_FILE, engine='python')
While this produces the ValueError:
df = pd.read_csv(REMOTE_FILE)
Do you know why certain remote files seem to require engine='python'
to be read?
the default is to read with engine='c'
which is the high performance c-engine. engine='python'
is a python based read (quite a bit slower, but exists for compat).
I would suspect its either a buffering problem, or the file is terminating early (e.g. sending EOF).
Thanks again for the ideas.
By they way, the way I figured this out was that when I used a file handle, the error message was helpful:
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
I found that using engine='python'
fixed the problem using either the file handle or file path. Is it feasible to change to error message from the when read_csv is used with a file path to include the engine='python'
suggestion?
how exactly are you using a file handle. should be
with open(...) as fh:
result = pd.read_csv(fh)
I was doing it like this:
fh1 = open(REMOTE_FILE, 'r')
df = pd.read_csv(fh1, engine='python')
But your format gets the same result.
For the record, using engine='python'
also seems to have fixed the problem of some files being silently truncated while being read.
md5sum: bb39ad5a080b647ecf245ae126a8eb93 remote_drive/mir-count.csv bb39ad5a080b647ecf245ae126a8eb93 local_drive/mir-count.csv
in python import pandas as pd
print pd.version -# Shows 0.17.1
LOCAL_FILE="local_drive/mir-count.csv" REMOTE_FILE="remote_drive/mir-count.csv"
pd.read_csv(LOCAL_FILE) -# File reads without errors.
for l in open(mir_expression_file, 'r'): print l -#Prints lines of file, no errors.
pd.read_csv(REMOTE_FILE) -# Exception stack trace: Traceback (most recent call last): File "/home/kpillman/localwork/bioinformatics/scripts/project_specific/conn_circrna/circrna_count_mir_targets.py", line 22, in
mir_expression_df = pd.read_csv(mir_expression_file, index_col=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 498, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 275, in _read
parser = TextFileReader(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 590, in init
self._make_engine(self.engine)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 731, in _make_engine
self._engine = CParserWrapper(self.f, self.options)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1103, in init
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 518, in pandas.parser.TextReader.cinit (pandas/parser.c:5030)
ValueError: No columns to parse from file
Notes: Not every file on the mounted drive fails to be read. May do fail though and it is reproducible which files will fail and which will not.
I am using cifs: /etc/fstab line: //sacgf.ersa.edu.au/sacgf /data/sacgf cifs uid=kpillman,gid=kpillman,credentials=/home/kpillman/.smbcredentials,_netdev 0 0