Download previous AgamDao data and evaluate genotyping accuracy

sanjaynagi / AmpSeeker

A snakemake workflow for amplicon sequencing

https://sanjaynagi.github.io/AmpSeeker/

0 stars 3 forks source link

Download previous AgamDao data and evaluate genotyping accuracy #22

Closed sanjaynagi closed 9 months ago

sanjaynagi commented 1 year ago

Need to download larger subset of data (~100, 200 samples at least?)
And find corresponding Ag1000g sampleIDs so we can match them up
evaluate genotype concordance

sanjaynagi commented 1 year ago

This is the most pressing issue @ChabbyTMD, @eddUG I think we should try and get this sorted this month.

Until we get this done, we cant really start making the other bits (PCA, allele frequencies, filtering on SNP QCs). those will be relatively straightforward. Then the bulk of the work is complete, and I think it's good to get that done in good time since we have about 4 months till PAMCA.

ChabbyTMD commented 1 year ago

Hi Sanjay, please remind me, is this the dataset we used in the very beginning of this workflow development, but we only had a couple of samples?

sanjaynagi commented 1 year ago

yeah thats it trevor!

sanjaynagi commented 1 year ago

Update - following todays meeting we now have a script which uses ffq to download the Sanger AgamDao run (1020 samples total). They have VBS sample names, we just need to figure out to connect these VBS sample IDs to AG1000g sample IDs in the metadata. I have asked Alistair.

sanjaynagi commented 1 year ago

update - @ChabbyTMD We can match the samples to Ag1000g data! They are actually GAARD samples, which is a Donnelly group project - I had forgotten this information.

We can access the samples in the ag1000g by using the "3.2" release with something like -

import malariagen_data
ag3 = malariagen_data.Ag3(pre=True)

# retrieve metadata 
df_samples = ag3.sample_metadata(sample_sets="3.2")

# retrieve snp calls
ds_snps = ag3.snp_calls(region="2RL", sample_sets="3.2")

sanjaynagi commented 9 months ago

Closing this issue, would be good but allele frequencies look right, and I've more important things to do.