malariagen / ag1000g-phase3-data-paper

Other
1 stars 2 forks source link

new crosses meta data has fewer samples than crosses genotypes #40

Closed cclarkson closed 3 years ago

cclarkson commented 3 years ago

Crosses genotypes has data from 699 samples, as does the (old) crosses meta dats at vo_agam_release/v3/metadata/general/AG1000G-X/samples.meta.csv.

The new crosses meta, however, has only 519 rows.

Also, in the text we talk about 15 crosses, five of which are new. If I df.cross_id.unique() the new meta data I get 24 named crosses?

@hardingnj, any ideas what has happened here?

hardingnj commented 3 years ago

Thanks Chris.

We start with 809 (total derived samples). Then following sequencing and sequence QC we end up with 699.

In the crosses file that @jonbrenas curated crosses.tsv, we have 568 samples, of which 519 are present in the 699. The remainder are samples where we do not know the pedigree.

Then in cross.samples.meta.txt we have 11 crosses. I think that this file is the one I should use, but it needs the other 4 crosses adding.

So I guess, 2 options to fix: a) NH subset the crosses.tsv file to those 15. b) @jonbrenas to add 4 additional crosses to cross.samples.meta.txt and @hardingnj to point at this file instead.