pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.81k stars 17.98k forks source link

BUG: Merge "cross" will assert on systems with FIPS enforcement #48024

Closed szelenka closed 2 years ago

szelenka commented 2 years ago

Pandas version checks

Reproducible Example

import ssl
import pandas as pd

# if you have a FIPS capable SSL library
ssl.FIPS_mode_set(1)

df = pd.merge(
  pd.DataFrame([...]),
  pd.DataFrame([...]),
  how='cross'
)

Issue Description

When performing a cross merge on a system with FIPS Mode enabled, it will raise a ValueError when attempting to use a non-secure hashing algorithm (i.e. md5):

>       cross_col = f"_cross_{hashlib.md5().hexdigest()}"
E       ValueError: [digital envelope routines] disabled for fips

https://github.com/pandas-dev/pandas/blob/main/pandas/core/reshape/merge.py#L1295-L1314

The only workaround is to disable FIPS mode, which is insecure.

Expected Behavior

Pandas should be able to perform a cross merge on systems operating in FIPS mode. md5 is a forbidden algorithm in FIPS environments, so the easiest method is to change the md5 hash to a sha256 (or greater)

Installed Versions

python -c 'import pandas as pd; pd.show_versions()' INSTALLED VERSIONS ------------------ commit : e8093ba372f9adfe79439d90fe74b0b5b6dea9d6 python : 3.8.13.final.0 python-bits : 64 OS : Linux OS-release : 5.10.104-linuxkit Version : #1 SMP Wed Mar 9 19:05:23 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.4.3 numpy : 1.23.1 pytz : 2022.1 dateutil : 2.8.2 setuptools : 63.4.2 pip : 22.2.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.3 jinja2 : 3.1.2 IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : 2022.7.1 gcsfs : None markupsafe : 2.1.1 matplotlib : 3.5.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 7.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : 1.4.40 tables : None tabulate : 0.8.10 xarray : None xlrd : None xlwt : None zstandard : None
szelenka commented 2 years ago

You could add something similar to this: https://github.com/dask/dask/blob/49d11a4e6cd9f2296f73c250fb636b679becd871/dask/base.py#L911-L916

Where it's explicit in setting usedforsecurity=False, since this is just used as an ephemeral column to join the two dataframes

mroeschke commented 2 years ago

Thanks for the report. Makes sense to add the usedforsecurity=False like dask did.

There's another md5 usage here too: https://github.com/pandas-dev/pandas/blob/6db95e79297a4dcefff53ca151324949b0c956c3/pandas/tests/io/pytables/test_store.py#L65

Pull requests welcome!