Currently s3fs version >= 0.3.0 is causing issues (see pandas and s3fs related issues).
Note to reproduce each error restart your kernel before each run.
Error 1:
import pydbtools as pydb
# Works
df1 = pydb.read_sql("SELECT * FROM db.table limit 10")
# Errors (same traceback as issues referenced above)
df2 = pydb.read_sql("SELECT * FROM db.table limit 10")
Error 2:
from gluejobutils import s3
import pandas as pd
no_file = "s3://alpha-everyone/does_not_exist.csv"
is_file = "s3://alpha-everyone/iris.csv"
pd.read_csv(no_file) # Same error as above
Error 3:
from gluejobutils import s3
import pandas as pd
no_file = "s3://alpha-everyone/does_not_exist.csv"
is_file = "s3://alpha-everyone/iris.csv"
df1 = pd.read_csv(is_file) # works as expected
s3.copy_s3_object(is_file, no_file)
df2 = pd.read_csv(no_file) # same error
Issue seems to be that s3fs is caching the list of objects in the bucket. So the first call works fine as there is no cache. The second call to the same bucket refers to the cache and sees that there is no file (as it was created post caching).
Think the interim solution for now is to get pydbtools to set up it's own s3fs.S3FileSystem and parse that to pandas.read_csv so that we can force clearing the cache in s3fs (i.e copy this)
Currently using s3fs 0.3.4, pandas 0.25., pydbtools 1.0.2, gluejobutils 3.0.0
Currently s3fs version >= 0.3.0 is causing issues (see pandas and s3fs related issues).
Note to reproduce each error restart your kernel before each run.
Error 1:
Error 2:
Error 3:
Issue seems to be that s3fs is caching the list of objects in the bucket. So the first call works fine as there is no cache. The second call to the same bucket refers to the cache and sees that there is no file (as it was created post caching).
Think the interim solution for now is to get
pydbtools
to set up it's owns3fs.S3FileSystem
and parse that to pandas.read_csv so that we can force clearing the cache in s3fs (i.e copy this)