moj-analytical-services / pydbtools

Python version of dbtools
https://moj-analytical-services.github.io/pydbtools/
11 stars 2 forks source link

FileNotFoundError when using s3fs >= 0.3.0 #10

Closed isichei closed 5 years ago

isichei commented 5 years ago

Currently s3fs version >= 0.3.0 is causing issues (see pandas and s3fs related issues).

Note to reproduce each error restart your kernel before each run.

Error 1:

import pydbtools as pydb

 # Works
df1 = pydb.read_sql("SELECT * FROM db.table limit 10")

# Errors (same traceback as issues referenced above)
df2 = pydb.read_sql("SELECT * FROM db.table limit 10") 

Error 2:

from gluejobutils import s3
import pandas as pd

no_file = "s3://alpha-everyone/does_not_exist.csv"
is_file = "s3://alpha-everyone/iris.csv"

pd.read_csv(no_file) # Same error as above

Error 3:

from gluejobutils import s3
import pandas as pd

no_file = "s3://alpha-everyone/does_not_exist.csv"
is_file = "s3://alpha-everyone/iris.csv"

df1 = pd.read_csv(is_file) # works as expected
s3.copy_s3_object(is_file, no_file)

df2 = pd.read_csv(no_file) # same error

Issue seems to be that s3fs is caching the list of objects in the bucket. So the first call works fine as there is no cache. The second call to the same bucket refers to the cache and sees that there is no file (as it was created post caching).

Think the interim solution for now is to get pydbtools to set up it's own s3fs.S3FileSystem and parse that to pandas.read_csv so that we can force clearing the cache in s3fs (i.e copy this)

Currently using s3fs 0.3.4, pandas 0.25., pydbtools 1.0.2, gluejobutils 3.0.0

isichei commented 5 years ago

fsspec 0.3.6