Not something that needs any discussion now, but might be worth talking at some point about how the population definitions are stored. Currently they're stored as either a YAML or CSV file that maps population IDs like "ANG_1_coluzzii_2009" to sets of sample IDs. Some thoughts:
Re the population IDs...
Suggest to stick with standard two-letter country codes like "AO" rather than custom country codes like "ANG".
Also it might be convenient to use some kind of site name or abbreviation, rather than a number, just to make it easier to remember. So, e.g., "ANG_1_coluzzii_2009" might become "AO_luanda_2009_coluzzii".
It would be useful to have a human-readable label for each population, and for this to be included in the population definitions somehow. These can then be used in tables and labelling plots etc. E.g., the label for "ANG_1_coluzzii_2009" would probably be something like "Angola, Luanda, 2009, An. coluzzii".
I wonder if it would be more convenient to store the queries that selected the samples, rather than the sample IDs. E.g., in the population definitions file, rather than listing sample IDs explicitly, give the sample set ID and the query that selected the samples.
E.g., population_definitions.yml could be something like:
BF_bana_2012_coluzzii:
label: Burkina Faso, Bana, 2012, An. coluzzii
samples:
- sample_set: AG1000G-BF-A
query: location = "Bana" and species_aim = "coluzzii"
BF_pala_2012_coluzzii:
label: Burkina Faso, Pala, 2012, An. coluzzii
samples:
- sample_set: AG1000G-BF-A
query: location = "Pala" and species_aim = "coluzzii"
# etc.
Not something that needs any discussion now, but might be worth talking at some point about how the population definitions are stored. Currently they're stored as either a YAML or CSV file that maps population IDs like "ANG_1_coluzzii_2009" to sets of sample IDs. Some thoughts:
E.g., population_definitions.yml could be something like: