ENH: Improve Code Quality in pandas/core/reshape Module

Summary

Refactor the pandas/core/reshape module to improve code quality by reducing duplication, replacing hard-coded values, and simplifying complex conditionals.

Problem Description

The pandas/core/reshape module implements key reshaping functions (pivot, melt, and unstack) used in data manipulation workflows. A review of pivot.py and melt.py reveals a couple of areas where code quality could be improved:

Nested Conditionals:

In melt.py, nested conditionals add complexity, making the code harder to read and maintain.
Suggestion: Refactor these conditionals into smaller, more modular functions.

Hard-Coded Values:

In pivot.py, hard-coded strings (e.g., "All" for margins) reduce flexibility.
Suggestion: Replace hard-coded values with constants for maintainability.

Relevant File

melt.py
pivot.py

Proposed Solution

Refactor Nested Conditionals in melt.py

Nested Conditional in ensure_list_vars()

Before:

def ensure_list_vars(arg_vars, variable: str, columns) -> list:
if arg_vars is not None:
if not is_list_like(arg_vars):
    return [arg_vars]
elif isinstance(columns, MultiIndex) and not isinstance(arg_vars, list):
    raise ValueError(
        f"{variable} must be a list of tuples when columns are a MultiIndex"
    )
else:
    return list(arg_vars)
else:
return []

After:


def ensure_list_vars(arg_vars, variable: str, columns) -> list:
if arg_vars is None:
return []

if not is_list_like(arg_vars): return [arg_vars]

if isinstance(columns, MultiIndex) and not isinstance(arg_vars, list): raise ValueError( f"{variable} must be a list of tuples when columns are a MultiIndex" )

return list(arg_vars)

Nested Conditional in melt() for id_vars:

Before:

if id_vars or value_vars:
if col_level is not None:
level = frame.columns.get_level_values(col_level)
else:
level = frame.columns
labels = id_vars + value_vars
idx = level.get_indexer_for(labels)
missing = idx == -1
if missing.any():
missing_labels = [
    lab for lab, not_found in zip(labels, missing) if not_found
]
raise KeyError(
    "The following id_vars or value_vars are not present in "
    f"the DataFrame: {missing_labels}"
)
if value_vars_was_not_none:
frame = frame.iloc[:, algos.unique(idx)]
else:
frame = frame.copy(deep=False)
else:
frame = frame.copy(deep=False)

After:


def validate_and_get_level(frame, id_vars, value_vars, col_level):
level = frame.columns.get_level_values(col_level) if col_level is not None else frame.columns
labels = id_vars + value_vars
idx = level.get_indexer_for(labels)
missing = idx == -1
if missing.any():
missing_labels = [lab for lab, not_found in zip(labels, missing) if not_found]
raise KeyError(
    "The following id_vars or value_vars are not present in "
    f"the DataFrame: {missing_labels}"
)
return idx

if id_vars or value_vars: idx = validate_and_get_level(frame, id_vars, value_vars, col_level) if value_vars_was_not_none: frame = frame.iloc[:, algos.unique(idx)] else: frame = frame.copy(deep=False)

Nested Conditionals for Setting var_name in melt():

Before:

if var_name is None:
if isinstance(frame.columns, MultiIndex):
if len(frame.columns.names) == len(set(frame.columns.names)):
    var_name = frame.columns.names
else:
    var_name = [f"variable_{i}" for i in range(len(frame.columns.names))]
else:
var_name = [
    frame.columns.name if frame.columns.name is not None else "variable"
]
elif is_list_like(var_name):
if isinstance(frame.columns, MultiIndex):
if is_iterator(var_name):
    var_name = list(var_name)
if len(var_name) > len(frame.columns):
    raise ValueError(
        f"{var_name=} has {len(var_name)} items, "
        f"but the dataframe columns only have {len(frame.columns)} levels."
    )
else:
raise ValueError(f"{var_name=} must be a scalar.")
else:
var_name = [var_name]

After:


def determine_var_name(frame, var_name):
if var_name is None:
return _default_var_name(frame)
if is_list_like(var_name):
_validate_list_var_name(var_name, frame)
return list(var_name)
return [var_name]

def _default_varname(frame): if isinstance(frame.columns, MultiIndex): if len(frame.columns.names) == len(set(frame.columns.names)): return frame.columns.names return [f"variable{i}" for i in range(len(frame.columns.names))] return [frame.columns.name or "variable"]

def _validate_list_var_name(var_name, frame): if isinstance(frame.columns, MultiIndex): if is_iterator(var_name): var_name = list(var_name) if len(var_name) > len(frame.columns): raise ValueError( f"{var_name=} has {len(var_name)} items, " f"but the dataframe columns only have {len(frame.columns)} levels." ) else: raise ValueError(f"{var_name=} must be a scalar.")

var_name = determine_var_name(frame, var_name)

Benefits:
- Improves readability: Simplifies the main function, making the logic clearer and easier to follow.
- Makes the logic easier to test and maintain: Enables independent testing of each helper function, ensuring robust behavior.
- Separation of concerns: Each helper function is now responsible for a single, well-defined task, aligning with the principle of single responsibility.

Replace Hard-Coded Values in pivot.py

Before:

# Hard-coded string for margins
margins_name: Hashable = "All"

After:

# Define a constant for the hard-coded value
MARGIN_NAME = "All"

# Use the constant in the code
margins_name: Hashable = MARGIN_NAME:

Benefits:
- Makes the code more readable and maintainable.
- Centralizes the value so it can be reused or modified easily.

Testing

Unit Testing Helper Functions: Write focused tests for each new helper function to validate their behavior under expected, edge, and erroneous inputs. For example:

Ensure validate_and_get_level() correctly identifies missing variables and raises KeyError.
Test determine_var_name() with var_name=None, scalar inputs, and multi-level columns.

Regression Testing Parent Functions: Run all pre-existing tests for the parent functions (e.g., melt()) to confirm they maintain their functionality after the refactor.

Edge Cases: Include additional tests for edge scenarios, such as:

Empty id_vars or value_vars.
DataFrames with unusual column configurations like MultiIndex or missing names.

Labels

ENH
Code Quality

Compliance with Contributing Guide

Focus: The issue is specific and addresses code quality improvements without scope creep.
Clarity: Includes actionable suggestions and a clear implementation path.

pandas-dev / pandas