pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.41k stars 17.83k forks source link

pd.Series.reindex is not thread safe. #25870

Closed allComputableThings closed 5 years ago

allComputableThings commented 5 years ago

Code Sample, a copy-pastable example if possible

import traceback
import pandas as pd
import numpy as np
from multiprocessing.pool import ThreadPool

def f(arg):
    s,idx = arg
    try:
        # s.loc[idx]   # No problem
        s.reindex(idx) # Fails
    except Exception:
        traceback.print_exc()
    return None

def gen_args(n=10000):
    a = np.arange(0, 3000000)
    for i in xrange(n):
        if i%1000 == 0:
            # print "?",i
            s = pd.Series(data=a, index=a)
            f((s,a)) # <<< LOOK. IT WORKS HERE!!!
        yield s, np.arange(0,1000)

# for arg in gen_args():
#     f(arg)   # Works just fine

t = ThreadPool(4)
for result in t.imap(f, gen_args(), chunksize=1):
    pass

Problem description

pd.Series.reindex fails in a multi-threaded application.

This is a little surprising since I'm not asking for any writes.

The error also seems bogus: 'cannot reindex from a duplicate axis' ... the series does not have any duplicate axis and I was able to call s.reindex(idx) in the main thread before the same failed in the pool's thread.

  File "<ipython-input-8-4121235a46fa>", line 6, in f
    s.reindex(idx).values # Fails
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/series.py", line 2681, in reindex
    return super(Series, self).reindex(index=index, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 3023, in reindex
    fill_value, copy).__finalize__(self)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 3041, in _reindex_axes
    copy=copy, allow_dups=False)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 3145, in _reindex_with_indexers
    copy=copy)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 4139, in reindex_indexer
    self.axes[axis]._can_reindex(indexer)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexes/base.py", line 2944, in _can_reindex
    raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis

Expected Output

Program should output nothing.

Output of pd.show_versions()

``` INSTALLED VERSIONS ------------------ commit: None python: 2.7.15.candidate.1 python-bits: 64 OS: Linux OS-release: 4.15.0-46-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None pandas: 0.22.0 pytest: None pip: 18.1 setuptools: 40.6.2 Cython: 0.29.1 numpy: 1.16.1 scipy: 1.2.0 pyarrow: None xarray: None IPython: 5.0.0 sphinx: None patsy: 0.5.1 dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: 2.6.8 feather: None matplotlib: 2.1.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.6.0 html5lib: 0.9999999 sqlalchemy: 1.2.17 pymysql: None psycopg2: 2.7.7 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None ```
jreback commented 5 years ago

virtually no pandas functions are threadsafe, becuase .copy() is not, see https://github.com/pandas-dev/pandas/issues/2728

allComputableThings commented 5 years ago

Not very satisfactory - especially for non-mutating operations.

Since the bug you referenced is still open, could we keep this one open.

jreback commented 5 years ago

Since the bug you referenced is still open, could we keep this one open.

so we will have 1 more issue, what's the purpose? this is a duplicate issue

allComputableThings commented 5 years ago
jreback commented 5 years ago

your are welcome to submit a PR if you want to provide a test

this is a duplicate of an unfixed issue

we have 2900 issue - would welcome help doing things here - sure reporting bugs is great but pandas is all volunteer for anything else

allComputableThings commented 5 years ago

Would love, too.

If I had any idea why reading from an object that no-one is writing to would not be safe, I'd think about fixing it.