Open 121onto opened 8 years ago
This one is more memory efficient:
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
cs = frame[v].columns
vs = frame[v]
lcs = len(cs)
for i in range(lcs):
iv = vs.iloc[:,i].tolist()
for j in range(i+1, lcs):
jv = vs.iloc[:,j].tolist()
if iv == jv:
dups.append(cs[i])
break
return dups
@121onto what is the purpose of this issue?
you should need to benchmark this, with potentially mixed data, pass tests, etc.
Thanks @jreback. Here is modified code with a test:
from __future__ import print_function
from pandas import DataFrame
from pandas.core.common import array_equivalent
from pandas.util.testing import assert_equal
import numpy as np
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
cs = frame[v].columns
vs = frame[v]
lcs = len(cs)
for i in range(lcs):
ia = vs.iloc[:,i].values
for j in range(i+1, lcs):
ja = vs.iloc[:,j].values
if array_equivalent(ia, ja):
dups.append(cs[i])
break
return dups
def test_duplicate_columns():
data = {
'objects1' : ['A', 'B', 'C', 'C', 'B', 'A'],
'objects2' : ['A', 'B', 'C', 'B', 'B', 'A'],
'objects_dup' : ['A', 'B', 'C', 'C', 'B', 'A'],
'bools1' : [True, False, True, True, False, False],
'bools2' : [True, False, True, True, True, False],
'bools_dup' : [True, False, True, True, False, False],
'floats1' : [1., 2., 3., 3., 2., 1.],
'floats2' : [1., 3., 3., 5., 2., 1.],
'floats_dup' : [1., 2., 3., 3., 2., 1.],
'floats_withnan1' : [1., 2., 3., 3., np.nan, 1.],
'floats_withnan2' : [1., np.nan, 3., 3., np.nan, 1.],
'floats_withnan3' : [1., 2., 3., 3., 3., 1.],
'floats_withnan_dup' : [1., 2., 3., 3., np.nan, 1.],
'integers1' : [1, 2, 3, 3, 2, 1],
'integers2' : [1, 2, 2, 2, 2, 1],
'integers_dup' : [1, 2, 3, 3, 2, 1],
}
frame = DataFrame(data)
frame = frame.sort(axis=1)
dups = duplicate_columns(frame)
expected = [
'objects1',
'bools1',
'floats1',
'floats_withnan1',
'integers1',
]
assert_equal(sorted(dups), sorted(expected))
so what are the timings / memory figures?
A timing comparison with .T
def transpose_duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
tframe = frame[v].T
dups.extend(list(tframe.loc[tframe.duplicated(),:].index))
return dups
def profile_duplicate_columns(func, size=1):
data = {
'objects1' : ['A', 'B', 'C', 'C', 'B', 'A'],
'objects2' : ['A', 'B', 'C', 'B', 'B', 'A'],
'objects_dup' : ['A', 'B', 'C', 'C', 'B', 'A'],
'bools1' : [True, False, True, True, False, False],
'bools2' : [True, False, True, True, True, False],
'bools_dup' : [True, False, True, True, False, False],
'floats1' : [1., 2., 3., 3., 2., 1.],
'floats2' : [1., 3., 3., 5., 2., 1.],
'floats_dup' : [1., 2., 3., 3., 2., 1.],
'floats_withnan1' : [1., 2., 3., 3., np.nan, 1.],
'floats_withnan2' : [1., np.nan, 3., 3., np.nan, 1.],
'floats_withnan3' : [1., 2., 3., 3., 3., 1.],
'floats_withnan_dup' : [1., 2., 3., 3., np.nan, 1.],
'integers1' : [1, 2, 3, 3, 2, 1],
'integers2' : [1, 2, 2, 2, 2, 1],
'integers_dup' : [1, 2, 3, 3, 2, 1],
}
frame = DataFrame(data)
frame = frame.sort(axis=1)
for i in range(size):
frame = pd.concat([frame,frame]).reset_index(drop=True)
return func(frame)
Using timeit:
In [71]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=3)
10 loops, best of 3: 46.1 ms per loop
In [72]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=4)
10 loops, best of 3: 80.2 ms per loop
In [73]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=5)
10 loops, best of 3: 149 ms per loop
In [74]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=6)
1 loops, best of 3: 336 ms per loop
In [75]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=7)
1 loops, best of 3: 1.19 s per loop
In [76]: %timeit profile_duplicate_columns(transpose_duplicate_columns, size=8)
1 loops, best of 3: 6.35 s per loop
In [81]: %timeit profile_duplicate_columns(duplicate_columns, size=3)
10 loops, best of 3: 21.3 ms per loop
In [82]: %timeit profile_duplicate_columns(duplicate_columns, size=4)
10 loops, best of 3: 23.8 ms per loop
In [83]: %timeit profile_duplicate_columns(duplicate_columns, size=5)
10 loops, best of 3: 26.5 ms per loop
In [84]: %timeit profile_duplicate_columns(duplicate_columns, size=6)
10 loops, best of 3: 28.8 ms per loop
In [85]: %timeit profile_duplicate_columns(duplicate_columns, size=7)
10 loops, best of 3: 31.5 ms per loop
In [86]: %timeit profile_duplicate_columns(duplicate_columns, size=8)
10 loops, best of 3: 34.1 ms per loop
In [87]: %timeit profile_duplicate_columns(duplicate_columns, size=20)
1 loops, best of 3: 4.75 s per loop
Not sure how to profile memory usage but would be open to suggestions.
still not clear what you are actually proposing
what you are timing and what drop--duplicates do are not the same thing
pls give a simple example
Ah, perhaps my naming convention is not the best. Maybe I should call it redundant_columns?
The functions are the same except each implements a distinct convention for picking out redundant columns: given a data frame with two identical columns 'first' and 'second', duplicate_columns will return 'first' while transpose_duplicate_columns will return 'second'.
duplicate_columns solves a practical problem. Suppose my client hands me a data set that was created by joining several tables. They may be sloppy about joining, in which case the data will contain redundant columns with potentially different labels. duplicate_columns solves this problem by creating a list of duplicative columns.
For example:
def build_frame():
data = {
'objects1' : ['A', 'B', 'C', 'C', 'B', 'A'],
'objects2' : ['A', 'B', 'C', 'B', 'B', 'A'],
'objects_dup1' : ['A', 'B', 'C', 'C', 'B', 'A'],
'objects_dup2' : ['A', 'B', 'C', 'C', 'B', 'A']
}
frame = DataFrame(data)
return frame.sort(axis=1)
frame = build_frame()
frame has several redundant columns pairs ('objects1', 'objects_dup1', 'objects_dup2' are the same). duplicate_columns finds these pairs, returning a list of redundant columns, treating the last instance as non-redundant. transpose_duplicate_columns does the same thing, except it treats the first instance as non-redundant
In [10]: frame = build_frame()
In [11]: frame
Out[12]:
objects1 objects2 objects_dup1 objects_dup2
0 A A A A
1 B B B B
2 C C C C
3 C B C C
4 B B B B
5 A A A A
In [13]: duplicate_columns(frame)
Out[13]: ['objects1', 'objects_dup1']
In [14]: transpose_duplicate_columns(frame)
Out[14]: ['objects_dup1', 'objects_dup2']
@121onto
so you are taking advantage of segregated dtypes, and using array_equiavalent
which is a quick way of determining equality, whereas .T.duplicated()
needs to factorize things first.
Ok, so this would be ok as axis=1
parameter for .duplicated()
(and equivalently for .drop_duplicates()
). Not that care must be taken with processing of the keep
parameter. Further this should return a boolean Series (like the current .duplicates
). Also, you don't need to groupby by dtypes, simply using array_equivalent will work.
Further would need an some asv perf tests for this.
pls submit a pull-request
Do you think it makes sense to compute and compare column sums before doing array_equivalent for data frames with many numeric columns?
May I ask if there's any chance of this being implemented? (This thread is actually on of the top results for googling this issue/ removing columns by duplicated values). Thanks!
@ddofer this would take a community pull request. It would be very difficult as you can start with the code above. The tricky part would be a) writing the benchmarks (done somwhat above), b) assuring it only minimilly affects the common case (no dups).
Hi all,
any chance to resurrect this issue? What do you mean by b)? As array_equivalent is marked as deprecated, will there be any implementation of duplicate columns? For me, this is a very common use case.
this is an open issue - would consider a pull request to implement
Hi all,
I recently stumbled upon this issue as this sort of functionality would have been very useful for a recent problem of mine. Is there still interest in this enhancement? To me, it seems that it could be a fairly simple addition: just iterate over rows instead of columns in .duplicated()
(still using factorize()
). But perhaps array_equivalent
as originally suggested may be faster--I am not very familiar with that function.
I have never contributed to pandas, but would be willing to learn about contribution and form a PR if there is still interest. I do see several other similar issues have been raised over the years with similar concerns, so at least a few people have desired this column deduplication functionality.
Here is sample code that finds duplicate columns in a DataFrame based on their values (useful for cleaning data):
I've seen others suggest something like
df.T.drop_duplicates().T
. However, transposing is a bad idea when working with large DataFrames.I would add a pull request, but I'm not sure I even know what that means.