pepkit / peppy

Project metadata manager for PEPs in Python
https://pep.databio.org/peppy
BSD 2-Clause "Simplified" License
37 stars 13 forks source link

`Project.from_pandas()` creates duplicate `sample_name` column in sample table #424

Closed nleroy917 closed 1 year ago

nleroy917 commented 1 year ago

Issue

When instantiating a peppy.Project() object using the from_pandas method, a duplicate sample_name column is created in the resultant pandas.DataFrame object.

Expected Behavior

I would expect only one sample_name column to be present when the table is instantiated.

Steps to reproduce

  1. Create fresh virtual env:

    python -m venv .venv && source .venv/bin/activate
    pip install --upgrade pip && pip install peppy
  2. Create the sample table file:

    cat sample_table.csv
    sample_name,protocol,file
    frog_1,anySampleType,data/frog1_data.txt
    frog_2,anySampleType,data/frog2_data.txt
  3. Load peppy/pandas and create new Project() object using from_pandas:

    
    import peppy
    import pandas as pd

df = pd.read_csv("sample_table.csv") p = peppy.Project().from_pandas(df) p.sample_table.to_csv("sample_table_processed.csv")


4. Observe duplicate sample_name column:
```console
cat sample_table_processed.csv
sample_name,sample_name,protocol,file
frog_1,frog_1,anySampleType,data/frog1_data.txt
frog_2,frog_2,anySampleType,data/frog2_data.txt
khoroshevskyi commented 1 year ago

After investigating peppy: it's not a problem with Project.from_pandas() function, it's how pandas is working. The problem that you mentioned: duplicated sample_name value it's not actually duplication. One sample_name is name of the index (row names) and second is actual value.

To solve this issue just add parameter index=False --> p.sample_table.to_csv(index=False)