Closed sanjaynagi closed 9 months ago
This is the most pressing issue @ChabbyTMD, @eddUG I think we should try and get this sorted this month.
Until we get this done, we cant really start making the other bits (PCA, allele frequencies, filtering on SNP QCs). those will be relatively straightforward. Then the bulk of the work is complete, and I think it's good to get that done in good time since we have about 4 months till PAMCA.
Hi Sanjay, please remind me, is this the dataset we used in the very beginning of this workflow development, but we only had a couple of samples?
yeah thats it trevor!
Update - following todays meeting we now have a script which uses ffq to download the Sanger AgamDao run (1020 samples total). They have VBS sample names, we just need to figure out to connect these VBS sample IDs to AG1000g sample IDs in the metadata. I have asked Alistair.
update - @ChabbyTMD We can match the samples to Ag1000g data! They are actually GAARD samples, which is a Donnelly group project - I had forgotten this information.
We can access the samples in the ag1000g by using the "3.2" release with something like -
import malariagen_data
ag3 = malariagen_data.Ag3(pre=True)
# retrieve metadata
df_samples = ag3.sample_metadata(sample_sets="3.2")
# retrieve snp calls
ds_snps = ag3.snp_calls(region="2RL", sample_sets="3.2")
Closing this issue, would be good but allele frequencies look right, and I've more important things to do.