jekwatt / idiomatic_pandas

Tips and tricks for the most common data handling task with pandas.
0 stars 0 forks source link

Comparing Value Difference Between 2 CSV Files using pandas #22

Open jekwatt opened 2 years ago

jekwatt commented 2 years ago

To find rows with discrepancies:

Method 1: isin function/method

result1 = df1[~df1.apply(tuple, 1).isin(df2.apply(tuple, 1))]
print(result1)

Method 2: merge

# indicator parameter will insert a new field ("_merge")
result2 = df1.merge(df3, indicator=True, how="outer").loc[lambda v: v["_merge"] != "both"]
print(result2)

result3 = df1.merge(df3, indicator=True, how="outer")
result3[test_merge._merge != "both"]
jekwatt commented 2 years ago

Answers from James:

from pandas import DataFrame, date_range, MultiIndex, merge
from numpy.random import default_rng
from string import ascii_lowercase

rng = default_rng(0)

idx = date_range('2000-01-01', periods=100)
lft = DataFrame({
'number': rng.integers(3, size=len(idx)),
'letter': rng.choice([*ascii_lowercase[:3]], size=len(idx)),
})
rgt = DataFrame({
'number': rng.integers(3, size=len(idx)),
'letter': rng.choice([*ascii_lowercase[:3]], size=len(idx)),
})

cols = lft.columns.intersection(rgt.columns)
diff_idx = (lft[cols] != rgt[cols]).all(axis='columns')[lambda s: s].index
same_idx = (lft[cols] == rgt[cols]).all(axis='columns')[lambda s: s].index
same = merge(
lft[cols].loc[same_idx].set_axis(MultiIndex.from_product([['lft'], lft.columns]), axis='columns'),
rgt[cols].loc[same_idx].set_axis(MultiIndex.from_product([['rgt'], rgt.columns]), axis='columns'),
left_index=True, right_index=True,
)
diff = merge(
lft[cols].loc[diff_idx].set_axis(MultiIndex.from_product([['lft'], lft.columns]), axis='columns'),
rgt[cols].loc[diff_idx].set_axis(MultiIndex.from_product([['rgt'], rgt.columns]), axis='columns'),
left_index=True, right_index=True,
)

print(
same,
diff,
sep='\n',
)

lft and rgt are DataFrames with some commonly named columns, where the row values, aligned on a common index (a time index in this case,) show row-level differences.

I want to see what's the same, what's different.