pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.93k stars 18.03k forks source link

read_csv s3 file parameter memory_map default is True? #30555

Open autocyz opened 4 years ago

autocyz commented 4 years ago

Code Sample

# code 1
def get_pid():
    info = pd.read_csv('s3://vision.algo.data/cyz/tmp/test/11.csv', header=None, names=['pid'])
    print(f'pid num : {len(info)}')

# code 2
def get_pid():
    info = pd.read_csv('s3://vision.algo.data/cyz/tmp/test/11.csv', header=None, names=['pid'], memory_map=False)
    print(f'pid num : {len(info)}')

while True:
    get_pid()
    time.sleep(4)

Problem description

when I use code1, if s3 file changed, info not change. code2 can find the change. The difference between these two code is parameter memory_map, actually, default memory_map is False, so I was confused.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.5.final.0 python-bits : 64 OS : Linux OS-release : 4.4.0-1098-aws machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 0.25.3 numpy : 1.17.4 pytz : 2019.3 dateutil : 2.8.0 pip : 19.3.1 setuptools : 42.0.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.8.4 (dt dec pq3 ext lo64) jinja2 : 2.10.3 IPython : 7.10.1 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None s3fs : 0.4.0 scipy : 1.3.3 sqlalchemy : 1.3.11 tables : None xarray : None xlrd : None xlwt : None xlsxwriter : None
Liam3851 commented 4 years ago

@autocyz How quickly did you read the file after you changed it? S3 does not guarantee read-after-update consistency (see https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel). If you let less than 2-3 minutes go by between your update and running get_pid() you might have gotten a stale copy.

autocyz commented 4 years ago

@Liam3851 If I use code1, it never find the s3 change, but code2 can. I think it's not about s3 update consistency, it's about s3fs memory cache.

TomAugspurger commented 4 years ago

Do you think it's an s3fs issue then? It does cache file listings, and the instances of S3FileSystem are cached. You could try S3FS.clear_instance_cache.

autocyz commented 4 years ago

pd.read_csv default memory_map seem not to work, if I explicitly set memory_map to False, it work