pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.62k stars 17.58k forks source link

ENH: add axis argument to .drop_duplicates and .duplicated #11250

Open 121onto opened 8 years ago

121onto commented 8 years ago

Here is sample code that finds duplicate columns in a DataFrame based on their values (useful for cleaning data):

def duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []
    for t, v in groups.items():
        dcols = frame[v].to_dict(orient="list")

        vs = dcols.values()
        ks = dcols.keys()
        lvs = len(vs)

        for i in range(lvs):
            for j in range(i+1,lvs):
                if vs[i] == vs[j]: 
                    dups.append(ks[i])
                    break

    return dups       

I've seen others suggest something like df.T.drop_duplicates().T. However, transposing is a bad idea when working with large DataFrames.

I would add a pull request, but I'm not sure I even know what that means.

121onto commented 8 years ago

This one is more memory efficient:

def duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []

    for t, v in groups.items():

        cs = frame[v].columns
        vs = frame[v]
        lcs = len(cs)

        for i in range(lcs):
            iv = vs.iloc[:,i].tolist()
            for j in range(i+1, lcs):
                jv = vs.iloc[:,j].tolist()
                if iv == jv:
                    dups.append(cs[i])
                    break

    return dups
jreback commented 8 years ago

@121onto what is the purpose of this issue?

you should need to benchmark this, with potentially mixed data, pass tests, etc.

121onto commented 8 years ago

Thanks @jreback. Here is modified code with a test:

from __future__ import print_function
from pandas import DataFrame
from pandas.core.common import array_equivalent
from pandas.util.testing import assert_equal
import numpy as np

def duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []

    for t, v in groups.items():

        cs = frame[v].columns
        vs = frame[v]
        lcs = len(cs)

        for i in range(lcs):
            ia = vs.iloc[:,i].values
            for j in range(i+1, lcs):
                ja = vs.iloc[:,j].values
                if array_equivalent(ia, ja):
                    dups.append(cs[i])
                    break

    return dups

def test_duplicate_columns():

        data = {
            'objects1' : ['A', 'B', 'C', 'C', 'B', 'A'],
            'objects2' : ['A', 'B', 'C', 'B', 'B', 'A'],
            'objects_dup' : ['A', 'B', 'C', 'C', 'B', 'A'],
            'bools1' : [True, False, True, True, False, False],
            'bools2' : [True, False, True, True, True, False],
            'bools_dup' : [True, False, True, True, False, False],
            'floats1' : [1., 2., 3., 3., 2., 1.],
            'floats2' : [1., 3., 3., 5., 2., 1.],
            'floats_dup' : [1., 2., 3., 3., 2., 1.],
            'floats_withnan1' : [1., 2., 3., 3., np.nan, 1.],
            'floats_withnan2' : [1., np.nan, 3., 3., np.nan, 1.],
            'floats_withnan3' : [1., 2., 3., 3., 3., 1.],
            'floats_withnan_dup' : [1., 2., 3., 3., np.nan, 1.],
            'integers1' : [1, 2, 3, 3, 2, 1],
            'integers2' : [1, 2, 2, 2, 2, 1],
            'integers_dup' : [1, 2, 3, 3, 2, 1],
        }

        frame = DataFrame(data)
        frame = frame.sort(axis=1)

        dups = duplicate_columns(frame)
        expected = [
            'objects1',
            'bools1',
            'floats1',
            'floats_withnan1',
            'integers1',
        ]

        assert_equal(sorted(dups), sorted(expected))
jreback commented 8 years ago

so what are the timings / memory figures?

121onto commented 8 years ago

A timing comparison with .T

def transpose_duplicate_columns(frame):
    groups = frame.columns.to_series().groupby(frame.dtypes).groups
    dups = []

    for t, v in groups.items():
        tframe = frame[v].T
        dups.extend(list(tframe.loc[tframe.duplicated(),:].index))

    return dups

def profile_duplicate_columns(func, size=1):
        data = {
            'objects1' : ['A', 'B', 'C', 'C', 'B', 'A'],
            'objects2' : ['A', 'B', 'C', 'B', 'B', 'A'],
            'objects_dup' : ['A', 'B', 'C', 'C', 'B', 'A'],
            'bools1' : [True, False, True, True, False, False],
            'bools2' : [True, False, True, True, True, False],
            'bools_dup' : [True, False, True, True, False, False],
            'floats1' : [1., 2., 3., 3., 2., 1.],
            'floats2' : [1., 3., 3., 5., 2., 1.],
            'floats_dup' : [1., 2., 3., 3., 2., 1.],
            'floats_withnan1' : [1., 2., 3., 3., np.nan, 1.],
            'floats_withnan2' : [1., np.nan, 3., 3., np.nan, 1.],
            'floats_withnan3' : [1., 2., 3., 3., 3., 1.],
            'floats_withnan_dup' : [1., 2., 3., 3., np.nan, 1.],
            'integers1' : [1, 2, 3, 3, 2, 1],
            'integers2' : [1, 2, 2, 2, 2, 1],
            'integers_dup' : [1, 2, 3, 3, 2, 1],
        }

        frame = DataFrame(data)
        frame = frame.sort(axis=1)

        for i in range(size):
                frame = pd.concat([frame,frame]).reset_index(drop=True)

        return func(frame)

Using timeit:

In [71]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=3)
10 loops, best of 3: 46.1 ms per loop

In [72]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=4)
10 loops, best of 3: 80.2 ms per loop

In [73]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=5)
10 loops, best of 3: 149 ms per loop

In [74]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=6)
1 loops, best of 3: 336 ms per loop

In [75]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=7)
1 loops, best of 3: 1.19 s per loop

In [76]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=8)
1 loops, best of 3: 6.35 s per loop

In [81]: %timeit profile_duplicate_columns(duplicate_columns, size=3)
10 loops, best of 3: 21.3 ms per loop

In [82]: %timeit profile_duplicate_columns(duplicate_columns, size=4)
10 loops, best of 3: 23.8 ms per loop

In [83]: %timeit profile_duplicate_columns(duplicate_columns, size=5)
10 loops, best of 3: 26.5 ms per loop

In [84]: %timeit profile_duplicate_columns(duplicate_columns, size=6)
10 loops, best of 3: 28.8 ms per loop

In [85]: %timeit profile_duplicate_columns(duplicate_columns, size=7)
10 loops, best of 3: 31.5 ms per loop

In [86]: %timeit profile_duplicate_columns(duplicate_columns, size=8)
10 loops, best of 3: 34.1 ms per loop

In [87]: %timeit profile_duplicate_columns(duplicate_columns, size=20)
1 loops, best of 3: 4.75 s per loop
121onto commented 8 years ago

Not sure how to profile memory usage but would be open to suggestions.

jreback commented 8 years ago

still not clear what you are actually proposing

what you are timing and what drop--duplicates do are not the same thing

pls give a simple example

121onto commented 8 years ago

Ah, perhaps my naming convention is not the best. Maybe I should call it redundant_columns?

The functions are the same except each implements a distinct convention for picking out redundant columns: given a data frame with two identical columns 'first' and 'second', duplicate_columns will return 'first' while transpose_duplicate_columns will return 'second'.

duplicate_columns solves a practical problem. Suppose my client hands me a data set that was created by joining several tables. They may be sloppy about joining, in which case the data will contain redundant columns with potentially different labels. duplicate_columns solves this problem by creating a list of duplicative columns.

For example:

def build_frame():
        data = {
            'objects1' : ['A', 'B', 'C', 'C', 'B', 'A'],
            'objects2' : ['A', 'B', 'C', 'B', 'B', 'A'],
            'objects_dup1' : ['A', 'B', 'C', 'C', 'B', 'A'],
            'objects_dup2' : ['A', 'B', 'C', 'C', 'B', 'A']
        }

        frame = DataFrame(data)
        return frame.sort(axis=1)

frame = build_frame()

frame has several redundant columns pairs ('objects1', 'objects_dup1', 'objects_dup2' are the same). duplicate_columns finds these pairs, returning a list of redundant columns, treating the last instance as non-redundant. transpose_duplicate_columns does the same thing, except it treats the first instance as non-redundant

In [10]: frame = build_frame()

In [11]: frame
Out[12]:
  objects1 objects2 objects_dup1 objects_dup2
0        A        A            A            A
1        B        B            B            B
2        C        C            C            C
3        C        B            C            C
4        B        B            B            B
5        A        A            A            A

In [13]: duplicate_columns(frame)
Out[13]: ['objects1', 'objects_dup1']

In [14]: transpose_duplicate_columns(frame)
Out[14]: ['objects_dup1', 'objects_dup2']
jreback commented 8 years ago

@121onto

so you are taking advantage of segregated dtypes, and using array_equiavalent which is a quick way of determining equality, whereas .T.duplicated() needs to factorize things first.

Ok, so this would be ok as axis=1 parameter for .duplicated() (and equivalently for .drop_duplicates()). Not that care must be taken with processing of the keep parameter. Further this should return a boolean Series (like the current .duplicates). Also, you don't need to groupby by dtypes, simply using array_equivalent will work.

Further would need an some asv perf tests for this.

pls submit a pull-request

bolau commented 8 years ago

Do you think it makes sense to compute and compare column sums before doing array_equivalent for data frames with many numeric columns?

ddofer commented 6 years ago

May I ask if there's any chance of this being implemented? (This thread is actually on of the top results for googling this issue/ removing columns by duplicated values). Thanks!

jreback commented 6 years ago

@ddofer this would take a community pull request. It would be very difficult as you can start with the code above. The tricky part would be a) writing the benchmarks (done somwhat above), b) assuring it only minimilly affects the common case (no dups).

TKlerx commented 6 years ago

Hi all,

any chance to resurrect this issue? What do you mean by b)? As array_equivalent is marked as deprecated, will there be any implementation of duplicate columns? For me, this is a very common use case.

jreback commented 6 years ago

this is an open issue - would consider a pull request to implement

bking124 commented 2 months ago

Hi all,

I recently stumbled upon this issue as this sort of functionality would have been very useful for a recent problem of mine. Is there still interest in this enhancement? To me, it seems that it could be a fairly simple addition: just iterate over rows instead of columns in .duplicated() (still using factorize()). But perhaps array_equivalent as originally suggested may be faster--I am not very familiar with that function.

I have never contributed to pandas, but would be willing to learn about contribution and form a PR if there is still interest. I do see several other similar issues have been raised over the years with similar concerns, so at least a few people have desired this column deduplication functionality.