Open dileksahin opened 9 months ago
Hi!
Thanks for sharing all the details - the main issue I see is that the reference populations only have 4,845 sites, while the individuals being assigned have 1,460,482 sites. The same sites/SNPs need to be used for the reference and assigned individuals.
Did you use that ANGSD command once and then split the Beagle file to get 2 reference and assignment files? If so, that should result in two files with the same number of SNPs. It seems odd that the reference populations only have four-thousand some SNPs given the parameters you specify, unless the ${all_population} variable is set too high for 60 individuals.
Do the pop_like_LOO.txt file from the reference populations have actual values (not nan)? If those are also nan then that's indicative of something else also being an issue
Cheers, matt
On Sat, Dec 9, 2023 at 1:31 PM dileksahin @.***> wrote:
Hi, I obtain a .pop_like.txt with all nan values after assignment and couldn't figure out what would be the problem.
I use a reference (breeding) population: 60 individuals from 12 populations 4845 sites in the beagle file Assignment (nonbreeding) population: 117 individuals from the same 12 populations, 1460482 sites in the beagle file
Here's my workflow: Reference and assignment populations genotype likelihoods using
-GL 1 -doMajorMinor 1 -doMaf 1 -doGlf 2 -doGeno 5 -doCounts 1 -dumpCounts 4 -doPost 1 -postCutoff 0.85 -minMapQ 10 -minQ 20 -minInd ${all_population} -SNP_pval 1e-12 -minMaf 0.05
Reference pop allele frequencies with ${breeding_beagle} and ${breeding_IDs} --get_reference_af --ne_obs --loo
Assignment with ${nonbreeding_beagle} ${breeding_allele_freq_pop_af.npy} --get_pop_like
The .pop_like.txt is filled with nan
I checked breeding and non breeding beagle and breeding allele frequency npy files, none of them seemingly has a problem. Any idea what nan's would be related to?
— Reply to this email directly, view it on GitHub https://github.com/mgdesaix/WGSassign/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABURHJGDPCHJ5SNMTVGTSUDYITDD3AVCNFSM6AAAAABAOAZBAWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZTIMBUGIZDGOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Matt DeSaix, PhD Bird Genoscape Project https://www.birdgenoscape.org/ Department of Biology Colorado State University
Phone: (703) 431-0423 Website: https://mgdesaix.rbind.io
Thanks for the help!
Yes, the separate beagle files were the issue, the pop_like_LOO.txt
file had values. I did not think of running ANGSD once and splitting the beagle file but ran ANGSD separately on reference and assignment datasets. Seems like splitting beagle file according to individuals (columns) is a big challenge. As I had so many individuals, I instead ran the assignment dataset using angsd -sites
. Then the assignment worked. However, because angsd -sites
outputs counts.gz
file in the format of chrs_pos rather than chrs, I had issue with allele_counts_beagle.py
returning ValueError: invalid literal for int() with base 10.
Hi Dilek,
Great, I'm glad that worked out!
I agree - working with beagle files can be a little annoying and unfortunately no one, as far as I know, has written a handy little package for manipulating them yet (I've been meaning to when I get a chance). But for future reference, the "angsd -sites" option doesn't always work out and can annoyingly be off by several SNPs if you use 2 separate data sets. So, to get around that, if I already have all the individuals I plan to use for WGSassign as reference and assignment individuals, I specify the BAM list into ANGSD such that the reference individuals are all first and the assignment individuals come after...then for example, if I have 100 reference individuals and 60 assignment individuals, I can readily split the beagle file for the 2 sets:
zcat full.beagle.gz | cut -f1-303 | gzip -c > reference.beagle.gz
zcat full.beagle.gz | cut -f1-3,304- | gzip -c > assignment.beagle.gz
Another option to avoid going back to the "-sites" option, which can be slow too, is to use the following code to provide a list of the "chrom_pos" you want to subset to and provide with the beagle (I have a more in-depth explanation of that here https://github.com/mgdesaix/amre-adaptation/tree/main/02_PopulationGenetics):
awk 'NR==FNR{c[$1]++;next};c[$1]' ${ld_snps} <(zcat ${input}.beagle.gz) | gzip > ${output}.beagle.gz
Thank you for letting me know about the allele_counts_beagle.py
issue, I'll need to look into that
Cheers, matt
I'm also getting some -inf values for a handful of individuals in my pop_like.txt results. For all those individuals, all but one population is -inf, and that population happens to be the one with fewest reference individuals. Should I presume that the likelihood for the populations with -inf is 0, and therefore the valid likelihood value is the correct assignment?
Thanks for the awesome pipeline!
Hi Laura, thanks for reaching out! Interesting...is there anything else different about the individuals with -inf values being used for assignment compared to the rest of the individuals? Were these all produced in ANGSD? How many markers are being used?
I wouldn't presume those values are 0. Using just the reference individuals, can you successfully perform the leave-one-out assignment with --loo
?
Glad you're enjoying the software!
Hi, I obtain a
.pop_like.txt
with all nan values after assignment and couldn't figure out what would be the problem.I use a reference (breeding) population: 60 individuals from 12 populations 4845 sites in the beagle file Assignment (nonbreeding) population: 117 individuals from the same 12 populations, 1460482 sites in the beagle file
Here's my workflow: Reference and assignment populations genotype likelihoods using
-GL 1 -doMajorMinor 1 -doMaf 1 -doGlf 2 -doGeno 5 -doCounts 1 -dumpCounts 4 -doPost 1 -postCutoff 0.85 -minMapQ 10 -minQ 20 -minInd ${all_population} -SNP_pval 1e-12 -minMaf 0.05
Reference pop allele frequencies with
${breeding_beagle}
and${breeding_IDs}
--get_reference_af --ne_obs --loo
Assignment with
${nonbreeding_beagle} ${breeding_allele_freq_pop_af.npy} --get_pop_like
The
.pop_like.txt
is filled with nanI checked breeding and non breeding beagle and breeding allele frequency npy files, none of them seemingly has a problem. Any idea what nan's would be related to?