malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Make the released sequence QC summary stats available #547

Closed leehart closed 4 weeks ago

leehart commented 1 month ago

@alimanfoo @ahernank Regarding testing of the sequence QC summary stats data, which contain columns that contain contig identifiers, e.g. mean_cov_2L, which not only contain an uppercase letter, controversially, but also have a column order which is not the same as fixture.config['CONTIGS'].

I'm aware that insertion order in dictionary keys is guaranteed since Python 3.7, but I'm wondering whether the intention here was to actually assert that the order remained expected, rather than merely the set of columns.

def validate_metadata(df, expected_columns):
    # Check column names.
    expected_column_names = list(expected_columns.keys())
    assert df.columns.to_list() == expected_column_names

I was tempted to convert dicts like expected_columns to OrderedDict just to be explicit, but they would all need to be redefined as an array of tuples.

Are we aware of any reason that we need to check the specific order of columns in our sample metadata files? Would it be safe to change this test to compare sets rather than lists, and allow different column orders?

leehart commented 1 month ago

Conveniently, sorted(fixture.config['CONTIGS']) provides the matching order for both Ag3 and Af1, so I'll use that and keep everything else as it is.