simonhmartin / genomics_general

General tools for genomic analyses.
343 stars 93 forks source link

filterGenotypes.py, popgenWindows.py, popsFile "nan" #92

Open yangwukaidi opened 1 year ago

yangwukaidi commented 1 year ago

Hi Simon, thanks for the useful scripts! I'm having some issues running with this script.

First of all, problem occurred while filtering the gene. file, this is the script I used, as follows. python filterGenotypes.py --threads 2 -i output.geno.gz -o output-1.geno.gz --minAlleles 2 --minCalls 10 --thinDist 1000 This is the result of running the script: 30000 lines read | 2 pods queued | 0 pods filtered | 0 pods sorted | 0 pods written | 0 good lines written. 50000 lines read | 4 pods queued | 2 pods filtered | 2 pods sorted | 2 pods written | 0 good lines written. 70000 lines read | 6 pods queued | 4 pods filtered | 4 pods sorted | 4 pods written | 0 good lines written. 90000 lines read | 8 pods queued | 6 pods filtered | 6 pods sorted | 6 pods written | 0 good lines written. 110000 lines read | 10 pods queued | 8 pods filtered | 8 pods sorted | 8 pods written | 0 good lines written. Then, calculate with the filtered file: python popgenWindows.py -w 200000 -m 20000 -g output-1.geno.gz -o output-12.csv.gz -f phased -T 1 -p pop1 -p pop2 --popsFile Dxy-name.txt This is the result of running the script: started worker 0 Writing final results... 0 windows were tested. 0 results were written. Done.

After that, I abandoned the filtering step and went straight to calculating genetic parameters. The script is as follows: python popgenWindows.py -w 50000 -m 5000 -g output.geno.gz -o output-23.csv.gz -f phased -T 1 -p pop2 -p pop3 --popsFile Dxy-name.txt Some running results are as follows: 13014 windows queued, 13014 results received, 0 results written. 14230 windows queued, 14230 results received, 0 results written. 15439 windows queued, 15439 results received, 0 results written. 16530 windows queued, 16530 results received, 0 results written. Writing final results... 17343 windows queued, 17343 results received, 0 results written. 17343 windows were tested. 0 results were written. Done.

So, I lowered -m, python popgenWindows.py -w 50000 -m 500 -g output.geno.gz -o output-12.csv.gz -f phased -T 1 -p pop1 -p pop2 --popsFile Dxy-name.txt Some running results are as follows: 16769 windows queued, 16768 results received, 9120 results written. 16893 windows queued, 16892 results received, 9176 results written. 17026 windows queued, 17025 results received, 9234 results written. 17148 windows queued, 17147 results received, 9293 results written. 17284 windows queued, 17283 results received, 9351 results written. Writing final results... 17343 windows queued, 17343 results received, 9377 results written. 17343 windows were tested. 9377 results were written. Done. However, I still have problems with my results file. as follows: scaffold | start | end | mid | sites | pi_pop1 | pi_pop2 | dxy_pop1_pop2 | Fst_pop1_pop2 LG01 | 100001 | 150000 | 123397 | 829 | nan | nan | nan | nan LG01 | 250001 | 300000 | 264204 | 1523 | nan | nan | nan | nan LG01 | 350001 | 400000 | 374815 | 513 | nan | nan | nan | nan LG01 | 700001 | 750000 | 723642 | 915 | nan | nan | nan | nan LG01 | 750001 | 800000 | 771308 | 727 | nan | nan | nan | nan LG01 | 800001 | 850000 | 824155 | 812 | nan | nan | nan | nan LG01 | 850001 | 900000 | 880109 | 578 | nan | nan | nan | nan LG01 | 900001 | 950000 | 925854 | 785 | nan | nan | nan | nan LG01 | 950001 | 1000000 | 978705 | 901 | nan | nan | nan | nan LG01 | 1000001 | 1050000 | 1021747 | 2847 | nan | nan | nan | nan LG01 | 1050001 | 1100000 | 1079445 | 1036 | nan | nan | nan | nan LG01 | 1100001 | 1150000 | 1125246 | 631 | nan | nan | nan | nan LG01 | 1250001 | 1300000 | 1273721 | 543 | nan | nan | nan | nan

Finally, I tried to lower -m again. But the result file is the same as above.

To solve the above problems, can you give some suggestions? Thanks! Wish you a happy life!

simonhmartin commented 1 year ago

Hi,

For me to diagnose the problem, you would need to send an example of your input geno file. For example the first 1,000 lines.

Some points:

Your initial attempt to filter to only variant sites does not make sense. My scripts require invariant sites to compute pi and dxy. Also, your initial attempt to thin the data seems like a strange thing to do when calculating pi and dxy. Why reduce the amount of data?