soo-h / DASweepDetect

Detection and classification selective sweep use domain adaptive model
MIT License
8 stars 1 forks source link

Question for step2 runtime #3

Open Yuzhongzian opened 1 month ago

Yuzhongzian commented 1 month ago

In STEP2 I ran the sample commands from STEP2 one by one for each chromosome, however, after a week of running it has not run and no errors have been reported and it is still running, is this normal speed? The real data chromosome has 800,000 loci, and the N*2 is 134, the sample num is 145. The rest are default parameters. Looking forward to your answer. Thank you,wish you have a good day.

soo-h commented 1 month ago

It seems like you're referring to STEP3 rather than STEP2 based on your description. If that's the case, you're likely calculating features for the real genome, which corresponds to STEP3. The total runtime will depend on the size of your genome and the number of CPUs being used. If you're running the process on 16 CPU cores(default), the calculations typically should not take longer than one week.

To verify if the process is running correctly, you can check the real_feature folder for files with the pattern *chr* npy. These files indicate that feature computation for the corresponding chromosome has successfully completed. I hope this helps you confirm whether the program is progressing as expected.

Yuzhongzian commented 1 month ago

Thanks for your reply. I am very sure that the problem lies in Step 2, because I have finished the Feature engineering for real genomic data program in Step 3 very quickly. However, the command 'python3.8 DASD.py calc_domain -i simu_data -o simu_feature' in Step 2 has not produced any results even after half a month,it still shows that it is running and no error is reported.

soo-h commented 4 weeks ago

It is indeed unusual for the process to run for nearly half a month . I suggest you first check the rep parameter settings and ensure that all 16 CPUs are being used for feature computation. For Step 2, the runtime largely depends on the number of repetitions per simulation category (as specified by the rep parameter in configure.txt) and the number of CPUs utilized. If your rep value is set to 2500 and you are using 16 CPUs (the default setting), the feature computation should typically finish within a day.

In addition, you can inspect the files generated in the simu_feature folder to help determine whether the program is running as expected. If, for example, your simulated data category is named chSel, check whether the folder contains a chSel.npy file (indicating that the feature computation for chSel is complete) or multiple files with names like chSel_*.npy (indicating that part of the features for chSel have been computed). If you observe the latter, it would be helpful if you could share details about the files generated in the simu_feature folder and their timestamps to further verify if the program is running as expected.

Yuzhongzian commented 4 weeks ago

Thanks again for your reply. My rep setting is 2500, --core is 32, and after a day of calculation, there is still no progress, no errors, and the file size is always 46 Bytes. The following are the generated files (npy files are not generated) (vcf and simu_data files are the previous input files): 7ab839cb4c11d1ffcbada61178aaa3d

soo-h commented 4 weeks ago

Normally, the size of the intermediate files should progressively increase during computation. However, most intermediate files still remain at 0 size after a day. I suspect this issue might stem from the simu_data files, as the program will filter out population genomic simulation data where the number of segregating sites is less than 250.You can check if the number of segregating sites (segsite) is consistently below 250 in each simulation repetition using the following command:

grep 'segsite' file_name | awk '{if ($NF > 250) count++} END {print count}'

file_name is your simulation data. If you find that most values are below 250, you may need to adjust the simulation parameters.

Yuzhongzian commented 3 weeks ago

Thank you for your patient reply!

I found that some files in my simu_data file have segsites: 0, and some files show: trajectory too bigly. step= 500000000 freq = 0.005626. killing myself gently. At the same time, I found that the files in my simu_config folder are very different from the files in your example. I don't know if it is the cause of my above problems. The following is my simu_config file:

rep 2500 hap 100 len 100000 Pt 2.52511e-08 2.90284e-06 Pr 7.532528434782607e-07 2.259758530434782e-06 ws Pu 0.0005982817348574893 0 Pa 0.134 13.4 Pxx 0-0.475 0.525-1 en 0.141165435444 0 1.0 en 0.19614891854250002 0 524.6650128646128 en 0.272548102343 0 261.38879062010415 en 0.37870323083999996 0 231.9062642666145 en 0.526207126555 0 199.68026224813192 en 0.731164519305 0 289.2320440865463 en 1.015948620665 0 272.0882074433997 en 1.41165435444 0 189.9748561659159 en 1.961489185425 0 319.00228206810317 en 2.72548102343 0 364.93329057087874 en 3.7870323084 0 346.4296370267983 en 5.26207126555 0 366.3856725183859 en 7.31164519305 0 374.9726480629556 en 10.15948620665 0 394.3652306519253 en 14.116543544399999 0 468.5992208284258 en 19.61489185425 0 598.9680497593309 en 27.2548102343 0 598.9138238495566 en 37.870323084 0 766.7316693193319 en 52.6207126555 0 676.3985552562185 en 73.1164519305 0 944.3527826098504 en 101.59486206650001 0 1033.3115385313936 en 141.16543544400002 0 1297.3475260755156 en 196.1489185425 0 601.2499907529431 en 272.54810234300004 0 1563.95968947924 en 378.70323084 0 929.4815347290862 en 526.2071265550001 0 1030.5115507366606 en 731.164519305 0 810.8704017787153 en 1015.948620665 0 769.8952057723433 en 10159.48620665 0 925.8769231770707

Mine is low-depth WGS data, 145 pigs, Ne value is 67 (N=2*67), m is 3.6e-9, and the rest are set according to Parameter_Inference.md. There is nothing wrong with the parameters, but the results are very different. The mutation.avg.rate file generated by Relate is as follows:

0 2.07381e-06 35.7143 2.90284e-06 49.6248 2.54298e-06 68.9535 2.12655e-06 95.8106 1.70534e-06 133.128 1.30799e-06 184.981 9.88658e-07 257.031 8.19262e-07 357.143 6.87773e-07 496.248 5.26239e-07 689.535 3.91655e-07 958.106 2.9208e-07 1331.28 2.21488e-07 1849.81 1.63355e-07 2570.31 1.25917e-07 3571.43 9.61068e-08 4962.48 8.05514e-08 6895.35 6.77001e-08 9581.06 5.49536e-08 13312.8 4.60424e-08 18498.1 4.27424e-08 25703.1 2.52511e-08 35714.3 3.55306e-08 49624.8 -nan 68953.5 -nan 95810.6 -nan 133128 -nan 184981 -nan 257031 -nan 357143 -nan 3.57143e+06 -nan

The pop_size.coal is as follows:

0

0 35.7143 49.6248 68.9535 95.8106 133.128 184.981 257.031 357.143 496.248 689.535 958.106 1331.28 1849.81 2570.31 3571.43 4962.48 6895.35 9581.06 13312.8 18498.1 25703.1 35714.3 49624.8 68953.5 95810.6 133128 184981 257031 357143 3.57143e+06

0 0 0.00568931 1.08437e-05 2.17657e-05 2.45328e-05 2.84921e-05 1.96704e-05 2.09098e-05 2.99477e-05 1.78347e-05 1.559e-05 1.64227e-05 1.55282e-05 1.51726e-05 1.44265e-05 1.21411e-05 9.49852e-06 9.49938e-06 7.42021e-06 8.41118e-06 6.02456e-06 5.5059e-06 4.38534e-06 9.46247e-06 3.63776e-06 6.12095e-06 5.52086e-06 7.0163e-06 7.38972e-06 6.14478e-06 7.5104e-06 7.5104e-06

soo-h commented 3 weeks ago

There is an issue with the scaling of your simulation parameters. Specifically, the values for Pt and Pr should be scaled by 4NL, meaning the values you provided need to be multiplied by 4 *87*100000 (since N = 87.88 based on your coalescent rate).

Additionally, considering that your study subject is pigs, the recombination rate parameters you have set before scaling, 7.532528434782607e-07 and 2.259758530434782e-06, look somewhat high. A magnitude in the range of e-08 or e-07 would appear more common.

Lastly, the population decline between the two most recent time points in your demographic history appears quite abrupt. I’m not sure whether this scenario can be successfully simulated. If not, you might want to generate a new time point between (35.7143, 87) and (49.6248, 46109) based on a population history model like the Exponential Population Decline, to replace (35.7143, 87). If you do this, please note that N will change, and any parameters scaled by N should be adjusted accordingly.

Yuzhongzian commented 3 weeks ago

Hello, I multiplied Pa and Pr by 4NL respectively according to your suggestion, but the following error occurred: discoal: discoalFunctions.c:2067: makeGametesMS: Assertion `size < 40000' failed. I suspected it was a problem with the hap settings, so I adjusted it to 50 and continued running, but the above problem still occurred.

soo-h commented 2 weeks ago

There could be multiple reasons for the observed results. Could you please provide the modified parameters now you used ? Additionally, ensure that Pu and Pr are scaled by 4NL, while Pa should be scaled by 2N.

Yuzhongzian commented 2 weeks ago

Thanks for your reply!This is the content of chLink_configure.txt:

rep 2500 hap 50 len 100000 Pt 2.345762124447031 98.34428100140919 Pr 19.44149521812313 58.3244856543694 ws Pu 0.0005982817348574893 0 Pa 0.134 13.4 Pxx 0-0.475 0.525-1 en 0.15177695455200002 0 1.0 en 0.21089359021500004 0 186.81908694430533 en 0.293035761994 0 104.42524104777253 en 0.40717065671999997 0 104.79071767400508 en 0.56576253869 0 72.89773309594122 en 0.7861267431900001 0 68.71482524112614 en 1.09231829407 0 79.6269225011553 en 1.5177695455200002 0 64.32351775185575 en 2.1089359021500003 0 79.89806660936941 en 2.93035761994 0 85.29068924360563 en 4.071706567200001 0 114.0004659180916 en 5.6576253869 0 129.45660639306516 en 7.861267431900001 0 118.13722271363075 en 10.9231829407 0 137.18277640726623 en 15.1776954552 0 148.2131737397478 en 21.0893590215 0 168.20601660892044 en 29.303576199400002 0 174.31167039968997 en 40.717065672000004 0 152.7046053682697 en 56.576253869 0 198.16317010275884 en 78.612674319 0 143.7905639043838 en 109.23182940700002 0 128.65555868706542 en 151.77695455200003 0 120.50764281394244 en 210.89359021500002 0 138.8468208354897 en 293.03576199400004 0 120.24918811382922 en 407.17065672 0 118.82476349579441 en 565.76253869 0 124.50144305724602 en 786.1267431900001 0 167.3354269707183 en 1092.31829407 0 30.767040882021572 en 10923.1829407 0 30.767040882021572

And here is the code I used to build the config file:


#make simu_config/.
import pandas
import numpy
import os
import math

def write_config(file,file_w, rep, hap, Len, Pt, Pr, ws, Pu, Pa, Px, Pxx,Pf,en):
    file_w.write(rep + '\n')
    file_w.write(hap + '\n')
    file_w.write(Len + '\n')
    file_w.write(Pt + '\n')
    if file=='neut_configure.txt':
        for line in en :
            file_w.write(line + '\n') 
    else:
        file_w.write(Pr + '\n')
        file_w.write(ws + '\n')
        file_w.write(Pu + '\n')
        file_w.write(Pa + '\n')
        if file=='chLink_configure.txt':
            file_w.write(Pxx + '\n')
        elif file=='chSel_configure.txt':
            file_w.write(Px + '\n')
        elif file=='csLink_configure.txt':
            file_w.write(Pxx + '\n')
            file_w.write(Pf + '\n')
        elif file=='csSel_configure.txt':
            file_w.write(Px + '\n')
            file_w.write(Pf + '\n')
        for line in en :
            file_w.write(line + '\n')   

#step1_dir is the result of Relate
def make_simu_config(work_dir,N,step1_dir):
    file_list=['chLink_configure.txt','chSel_configure.txt','csLink_configure.txt','csSel_configure.txt','neut_configure.txt']

    chr_simu_config_dir=f"{work_dir}/chr_simu_config"
    os.makedirs(chr_simu_config_dir,exist_ok=True)

    file2_w=open(f'{chr_simu_config_dir}/label_configure.txt','w')
    for i in file_list:
        file2_w.write(i+'\n')
    file2_w.close()

    for chr in range(1,19):
        simu_config_dir=f"{work_dir}/chr_simu_config/chr{chr}_simu_config"
        os.makedirs(simu_config_dir,exist_ok=True)
        mutation_file=pandas.read_csv(f'{step1_dir}/{chr}.recode.vcf_phased_mutation_avg.rate',sep=' ',header=None)
        #demographic_file=open(f'{step1_dir}/{chr}.recode.vcf_phased_pop_size','r')
        pop_size_coal=open(f'{step1_dir}/{chr}.recode.vcf_phased_pop_size.coal','r')
        pop_size_coal_list=[]
        for line in pop_size_coal:
            pop_size_coal_list.append(line)
        T_list=pop_size_coal_list[1].strip()
        T=T_list.split(' ')
        #[float(i) for i in T]
        #print(T)
        coalescence_rate_list=pop_size_coal_list[2].strip()
        coalescence_rate=coalescence_rate_list.split(' ')
        #[float(i) for i in coalescence_rate]
        #print(coalescence_rate)
        T = [None if str(x).lower() == 'nan' else x for x in T]
        coalescence_rate = [None if str(x).lower() == 'nan' else x for x in coalescence_rate]

        filtered_data1, filtered_data2 = zip(*[(x1, x2) for x1, x2 in zip(T, coalescence_rate) if x1 not in ['0', None] and x2 not in ['0', None]])
        filtered_data1 = list(filtered_data1)
        filtered_data2 = list(filtered_data2)
        #print(filtered_data2)
        siNL=4*(0.5/float(filtered_data2[0]))*100000
        #print(siNL)
        #print(len(filtered_data2))
        #print(filtered_data1)
        #break
        Mutation_Rate=mutation_file.iloc[:,1].to_list()
        Mutation_len=float(len(Mutation_Rate))
        Mutation_Rate_noNA=[i for i in Mutation_Rate if not math.isnan(i)]
        Mutation_Rate_noNA=[float(i) for i in Mutation_Rate_noNA] 
        print(numpy.sum(Mutation_Rate_noNA)/Mutation_len)
        rep=f'rep 2500'
        hap=f'hap 50'
        Len=f"len 100000"
        Pt=f'Pt {min(Mutation_Rate_noNA)*siNL} {max(Mutation_Rate_noNA)*siNL}'
        Pr=f'Pr {(numpy.sum(Mutation_Rate_noNA)/Mutation_len)*siNL} {(numpy.sum(Mutation_Rate_noNA)/Mutation_len)*3*siNL}'
        ws='ws'
        Pu=f'Pu 0.0005982817348574893 0' #default 0.0005982817348574893 0
        Pa=f'Pa {0.001*2*N} {0.1*2*N}'
        Px=f'Px 0.475 0.525'
        Pxx=f'Pxx 0-0.475 0.525-1'
        Pf=f'Pf 0.001 0.2'

        Nt=[1/(2*float(i)) for i in filtered_data2]
        #print(Nt)
        Na=Nt[0]
        t=[float(i)/(4*Na) for i in filtered_data1]
        Nt_Na=[i/Na for i in Nt]
        en=[f'en {i} 0 {h}' for i,h in zip(t,Nt_Na)]

        for file in file_list:
            file_w=open(f'{simu_config_dir}/{file}','w')   
            write_config(file,file_w, rep, hap, Len, Pt, Pr, ws, Pu, Pa, Px, Pxx,Pf, en)
            file_w.close()
soo-h commented 2 weeks ago

Thank you for your detailed explanation. The issue you are experiencing may be due to an excessively high recombination rate, which results in the number of segsite sites exceeding the limit of the Discoal simulation software (40,000). Based on the script you provided, your effective population size (Na) appears to be 67 (calculated from the selection intensity parameters 13.4/0.1/2 = 67). This suggests that the upper limit for the recombination rate is approximately 2×10^-6, and for the mutation rate, it is about 3×10^-6. These limits might lead to exceeding the separation points in certain parameter spaces. I recommend considering dividing the upper limits for both mutation and recombination rates by 10 or 5.

Additionally, based on the coalescent rate you provided earlier, your Na seems to equal 87. I am unsure why it is indicated as 67 here, so I suggest checking your input files to rule out any unexpected mistakes.

Yuzhongzian commented 2 weeks ago

Thank you for your answer. According to your suggestion, my problem may lie in the output file and parameters. I have a question here that I would like to ask you to answer. For the input parameter of Relate, should the Ne value be the average value of the first 50 generations or the value of the most recent generation (I used GONe to calculate Ne, and the result was that the Ne of the most recent generation decreased sharply)

soo-h commented 6 days ago

I think you should use the most recent Ne value rather than the average value of the first 50 generations. This is because the most recent Ne is likely to provide a more accurate reflection of the population's current status.

Yuzhongzian commented 3 days ago

Thank you for your reply, I have solved this problem.Refer to this answer ’https://github.com/kr-colab/discoal/issues/14‘. But I encountered new problems with the data annotation function, as follows: ’Error: The sample number of. /chr_simu_config/chr2_simu_feature/csLink_simulation_featureMap.npy is too small,check simulated data! ‘ The following is my simu_feature folder content:

图片

I checked the cslink.npy file through numpy. Data shape: (1398, 40, 200). The number of samples is 1398. I wonder if it is too small?Can I set the minimum sample size for training the model to 1100?

I checked the csLink_simulation file and found no errors. I have no idea how to solve this problem. Could you please help me?

soo-h commented 3 days ago

In general, chSel are more likely to trigger this error due to their relatively fewer segsites number. Additionally, the file size of chSel is smaller than that of csLink in the files you provided, so it might be worth checking if the sample sizes for chSel and chLink are smaller as well. In previous tests, we encountered situations where chSel sample sizes were below 1000, which indeed affected model training results. If your simulated data falls into this category, I would recommend increasing the number of simulations or slightly raising the mutation rate. As for the sample size of 1398, we haven't specifically evaluated (we have assessed a sample size of 1500, which is the default threshold, and it generally does not impact model performance). I think if the minimum sample size in your simulated data is 1398, you might consider adjusting the threshold to 1398 for model training.