hivdb / covid-drdb-payload

A relational database of SARS-CoV-2 resistance data.
Creative Commons Attribution Share Alike 4.0 International
7 stars 9 forks source link

[BUG] Some CSVs have `control_iso_nam` column instead of `control_iso_name` in tables/ref_isolate_pairs #1043

Closed by256 closed 4 months ago

by256 commented 4 months ago

Describe the error

Some of the CSV files in tables/ref_isolate_pairs have a typo in one of the column names.

Instead of control_iso_name, they have control_iso_nam.

To locate the error

The error is present in the following files:

You can reproduce this with the following script:

from pathlib import Path

import pandas as pd

isolate_pairs_path = Path("covid-drdb-payload/tables/ref_isolate_pairs")
for path in isolate_pairs_path.iterdir():
    pairs_df = pd.read_csv(path)
    if set(pairs_df.columns) != {"ref_name", "control_iso_name", "iso_name"}:
        print(f"{path.stem}: {list(pairs_df.columns)}")

which should output:

uraki22-pair: ['ref_name', 'control_iso_nam', 'iso_name']
uriu22-pair: ['ref_name', 'control_iso_nam', 'iso_name']
uriu23-pair: ['ref_name', 'control_iso_nam', 'iso_name']
ueno22-pair: ['ref_name', 'control_iso_nam', 'iso_name']
uriu23b-pair: ['ref_name', 'control_iso_nam', 'iso_name']
uriu21-pair: ['ref_name', 'control_iso_nam', 'iso_name']
turner21-pair: ['ref_name', 'control_iso_nam', 'iso_name']

Expected behavior

The columns should be named control_iso_name instead of control_iso_nam.

KaimingTao commented 4 months ago

@by256 I've using

for file in *.csv; do
    head -n 1 "$file"
done | sort | uniq > unique_patterns.txt

to find and fix the files.