ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

Pixy estimations are wrong #71

Closed lisagrigoreva closed 1 year ago

lisagrigoreva commented 1 year ago

Hello! I tried to use pixy with VCF file that includes nonvariable alleles and compared estimates with my in-house python script (based on the same PI formula) and vcftools. In my subset, I have 27 samples and I estimated PI per site. However, pixy gave an estimate NA whereas in-house script and vcftools gave me results for this site. Could it be something wrong with the python versions or your script itself? I'm using python 3.8.13

Here is my in-house script

#Read chromosomes one by one 
seqv=pd.read_table("~/chr5_1k.txt")
#Transpose matrix to have haplotypes in diagonal and 
test_transpose = np.transpose(seqv)
#Calculate number of differences in each row 
differences_list=[]
comparisons=0
#Iterate over each column 
for f in range(0, len(test_transpose.columns)):
    differences=0
    for i in range(len(test_transpose[f])):
        if str(test_transpose[f][i]) != '-':

            for j in range(i,len(test_transpose[0])):
                if str(test_transpose[f][j]) != '-':
                    if test_transpose[f][j]!= test_transpose[f][i]:
                        differences+= 1 
    differences_list.append(differences)

#Calculate number of nucleotides (will be turned to number of comparisons)
comparisons_list=[]
for f in range(0, len(test_transpose.columns)):
    comparisons=0
    for i in range(len(test_transpose[f])):
        if str(test_transpose[f][i])!='-':
            comparisons+=1
    comparisons_list.append(comparisons)

#Calculate combinations (n over k)
combinations=[]
for i in comparisons_list:
    combine=comb(i,2)
    combinations.append(combine)

#Merge with difference and combinations list 
difference=pd.DataFrame(differences_list, columns = ['differences']) #Add differences 
#To the differense list add position and number of combinations
difference['comb'] = combinations #Add combinations

##Implement window size
window_size=1
sequence_length=len(difference)
current_window=0

pi_raw_list=[]
pi_per_site_list=[]
start_win=[]
end_wind=[]

while current_window<=sequence_length-window_size:
    numerator=list(difference['differences'][current_window:current_window+window_size])
    denominator=list(difference['comb'][current_window:current_window+window_size])
    missing_data=0
    for i in range (0, len(numerator)):
        if numerator[i]==0 and denominator[i]==0:
            missing_data+=1
    mean_numerator=sum(numerator)/(window_size-missing_data)
    mean_denominator=sum(denominator)/(window_size-missing_data)
    pi_raw=mean_numerator/mean_denominator
    pi_per_site=pi_raw/(window_size-missing_data)
    pi_raw_list.append(pi_raw)
    pi_per_site_list.append(pi_per_site)
    start_win.append(current_window)
    end_wind.append(current_window+window_size)
    current_window+=window_size

zipped = list(zip(start_win, end_wind, pi_raw_list, pi_per_site_list))#Make one dataframe
df = pd.DataFrame(zipped, columns=['start', 'stop', 'pi_wind', 'pi_per_site'])
df
ksamuk commented 1 year ago

Hi there, as far as we know, our estimates are correct, but there are many reasons why an NA can be returned. If you can send me a subset of your VCF (that reproduces the NA) along with your populations file, I can have a look at what the cause of the difference is.

ksamuk commented 1 year ago

I'm going to close this issue as I haven't heard back, please feel free to reopen it if necessary.