mgdesaix / WGSassign

Population assignment from genotype likelihoods for low-coverage whole-genome sequencing data
GNU General Public License v3.0
5 stars 0 forks source link

NaN in pop_like.txt after assignment #5

Open dileksahin opened 9 months ago

dileksahin commented 9 months ago

Hi, I obtain a .pop_like.txt with all nan values after assignment and couldn't figure out what would be the problem.

I use a reference (breeding) population: 60 individuals from 12 populations 4845 sites in the beagle file Assignment (nonbreeding) population: 117 individuals from the same 12 populations, 1460482 sites in the beagle file

Here's my workflow: Reference and assignment populations genotype likelihoods using

-GL 1 -doMajorMinor 1 -doMaf 1 -doGlf 2 -doGeno 5 -doCounts 1 -dumpCounts 4 -doPost 1 -postCutoff 0.85 -minMapQ 10 -minQ 20 -minInd ${all_population} -SNP_pval 1e-12 -minMaf 0.05

Reference pop allele frequencies with ${breeding_beagle} and ${breeding_IDs} --get_reference_af --ne_obs --loo

Assignment with ${nonbreeding_beagle} ${breeding_allele_freq_pop_af.npy} --get_pop_like

The .pop_like.txt is filled with nan

I checked breeding and non breeding beagle and breeding allele frequency npy files, none of them seemingly has a problem. Any idea what nan's would be related to?

mgdesaix commented 9 months ago

Hi!

Thanks for sharing all the details - the main issue I see is that the reference populations only have 4,845 sites, while the individuals being assigned have 1,460,482 sites. The same sites/SNPs need to be used for the reference and assigned individuals.

Did you use that ANGSD command once and then split the Beagle file to get 2 reference and assignment files? If so, that should result in two files with the same number of SNPs. It seems odd that the reference populations only have four-thousand some SNPs given the parameters you specify, unless the ${all_population} variable is set too high for 60 individuals.

Do the pop_like_LOO.txt file from the reference populations have actual values (not nan)? If those are also nan then that's indicative of something else also being an issue

Cheers, matt

On Sat, Dec 9, 2023 at 1:31 PM dileksahin @.***> wrote:

Hi, I obtain a .pop_like.txt with all nan values after assignment and couldn't figure out what would be the problem.

I use a reference (breeding) population: 60 individuals from 12 populations 4845 sites in the beagle file Assignment (nonbreeding) population: 117 individuals from the same 12 populations, 1460482 sites in the beagle file

Here's my workflow: Reference and assignment populations genotype likelihoods using

-GL 1 -doMajorMinor 1 -doMaf 1 -doGlf 2 -doGeno 5 -doCounts 1 -dumpCounts 4 -doPost 1 -postCutoff 0.85 -minMapQ 10 -minQ 20 -minInd ${all_population} -SNP_pval 1e-12 -minMaf 0.05

Reference pop allele frequencies with ${breeding_beagle} and ${breeding_IDs} --get_reference_af --ne_obs --loo

Assignment with ${nonbreeding_beagle} ${breeding_allele_freq_pop_af.npy} --get_pop_like

The .pop_like.txt is filled with nan

I checked breeding and non breeding beagle and breeding allele frequency npy files, none of them seemingly has a problem. Any idea what nan's would be related to?

— Reply to this email directly, view it on GitHub https://github.com/mgdesaix/WGSassign/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABURHJGDPCHJ5SNMTVGTSUDYITDD3AVCNFSM6AAAAABAOAZBAWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZTIMBUGIZDGOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Matt DeSaix, PhD Bird Genoscape Project https://www.birdgenoscape.org/ Department of Biology Colorado State University

Phone: (703) 431-0423 Website: https://mgdesaix.rbind.io

dileksahin commented 9 months ago

Thanks for the help! Yes, the separate beagle files were the issue, the pop_like_LOO.txt file had values. I did not think of running ANGSD once and splitting the beagle file but ran ANGSD separately on reference and assignment datasets. Seems like splitting beagle file according to individuals (columns) is a big challenge. As I had so many individuals, I instead ran the assignment dataset using angsd -sites. Then the assignment worked. However, because angsd -sites outputs counts.gz file in the format of chrs_pos rather than chrs, I had issue with allele_counts_beagle.py returning ValueError: invalid literal for int() with base 10.

mgdesaix commented 9 months ago

Hi Dilek,

Great, I'm glad that worked out!

I agree - working with beagle files can be a little annoying and unfortunately no one, as far as I know, has written a handy little package for manipulating them yet (I've been meaning to when I get a chance). But for future reference, the "angsd -sites" option doesn't always work out and can annoyingly be off by several SNPs if you use 2 separate data sets. So, to get around that, if I already have all the individuals I plan to use for WGSassign as reference and assignment individuals, I specify the BAM list into ANGSD such that the reference individuals are all first and the assignment individuals come after...then for example, if I have 100 reference individuals and 60 assignment individuals, I can readily split the beagle file for the 2 sets:

zcat full.beagle.gz | cut -f1-303 | gzip -c > reference.beagle.gz
zcat full.beagle.gz | cut -f1-3,304- | gzip -c > assignment.beagle.gz

Another option to avoid going back to the "-sites" option, which can be slow too, is to use the following code to provide a list of the "chrom_pos" you want to subset to and provide with the beagle (I have a more in-depth explanation of that here https://github.com/mgdesaix/amre-adaptation/tree/main/02_PopulationGenetics):

awk 'NR==FNR{c[$1]++;next};c[$1]' ${ld_snps} <(zcat ${input}.beagle.gz) | gzip > ${output}.beagle.gz

Thank you for letting me know about the allele_counts_beagle.py issue, I'll need to look into that

Cheers, matt

laurahspencer commented 3 weeks ago

I'm also getting some -inf values for a handful of individuals in my pop_like.txt results. For all those individuals, all but one population is -inf, and that population happens to be the one with fewest reference individuals. Should I presume that the likelihood for the populations with -inf is 0, and therefore the valid likelihood value is the correct assignment?

Thanks for the awesome pipeline!

mgdesaix commented 3 weeks ago

Hi Laura, thanks for reaching out! Interesting...is there anything else different about the individuals with -inf values being used for assignment compared to the rest of the individuals? Were these all produced in ANGSD? How many markers are being used?

I wouldn't presume those values are 0. Using just the reference individuals, can you successfully perform the leave-one-out assignment with --loo ?

Glad you're enjoying the software!