parklab / MosaicForecast

A mosaic detecting software based on phasing and random forest
MIT License
60 stars 21 forks source link

bug report on ReadLevel_Features_extraction.py #30

Open JunhuiLi1017 opened 1 year ago

JunhuiLi1017 commented 1 year ago

Hi there,

It seems that there is a bug in the script ReadLevel_Features_extraction.py

I used the same snv list and bam file to extract features of snv, it will output different numbers of features in different times, and also will output the different number of mosaic variants.

Thanks, Junhui

douym commented 1 year ago

Hi @JunhuiLi1017 ,

Thanks for your info. That’s odd, since ReadLevel_Features_extraction.py should not have any stochastic procedures, and ReadLevel_Features_extraction.py would not give genotype predictions.

Could you compare the two output files of ReadLevel_Features_extraction.py and tell me what features are different? For example, for one specific variant, are the feature(s) different in two output files? Further, what do the different lines that present in only one file look like?

Thanks,

Yanmei

JunhuiLi1017 commented 1 year ago

Hi @douym,

Thanks for your response.

Here are the details of the difference between 2 outputs(output1 and output2) from the same input file and script.

For all samples, output1 and output2 are all different. For example, I have an SNV list with 660 variants.

in the output1: we got 275 variants with feature information.

in the output2: we got 266 variants with feature information.

The no. of the variant with a common position between output1 and output2 is 224, where 35 variants are different for GC content and context. Here is an example of a specific mutation with the same position but GC content and context are different.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

id | GCcontent | context -- | -- | -- -chr1-96186790-A-G | 0.45454546 | GCA -chr1-96186790-A-G | 0.33333333 | CTT

douym commented 1 year ago

Hi @douym,

Thanks for your response.

Here are the details of the difference between 2 outputs(output1 and output2) from the same input file and script.

For all samples, output1 and output2 are all different. For example, I have an SNV list with 660 variants.

in the output1: we got 275 variants with feature information.

in the output2: we got 266 variants with feature information.

The no. of the variant with a common position between output1 and output2 is 224, where 35 variants are different for GC content and context. Here is an example of a specific mutation with the same position but GC content and context are different.

id GCcontent context -chr1-96186790-A-G 0.45454546 GCA -chr1-96186790-A-G 0.33333333 CTT

Hi @JunhuiLi1017 ,

Thanks for kindly share a mini bam with me, but I could not repeat the bug you report. I ran for five times of the following command and the results are all the same: python ReadLevel_Features_extraction.py debug.bed debug.features ./ ../reference/hg19.fa hg19/k24.umap.wg.bw 1 bam

cat debug.features*|sort|uniq -c 5 id conflict_num mappability type length GCcontent ref_softclip alt_softclip querypos_p leftpos_p seqpos_p mapq_p baseq_p baseq_t ref_baseq1b_p ref_baseq1b_t alt_baseq1b_p alt_baseq1b_t sb_p context major_mismatches_mean minor_mismatches_mean mismatches_p AF dp mosaic_likelihood het_likelihood refhom_likelihood althom_likelihood mapq_difference sb_read12_p dp_diff indel_proportion_SNPonly alt2_proportion_SNPonly 5 debug~chr1~96186790~A~G 0 1.0 SNP 0 0.5238095238095238 0.0 0.0 0.05368455874274253 0.05368455874274253 0.7627475790602667 1.0 0.0001268495297680491 3.832496257329701 0.0015666667700549417 2.837196152055328 0.8500332141963067 0.6843881185220807 1.0 ATG 0.006867794947265147 0.015011037527593824 0.0025382235725089314 0.35714285714285715 42 0.5066377597179058 0.4933622402820942 1.6222399541098435e-29 6.569311047411848e-75 0.0 0.530584882534201 0.0 0.0

JunhuiLi1017 commented 1 year ago

Thanks, @douym , when I run bsub -n 5 -R rusage[mem=1000] -W 0:20 -q short -e $sam.snv.err -o $sam.snv.out -R 'span[hosts=1]' 'source ~/anaconda3/etc/profile.d/conda.sh && conda activate py3.7.1 && python ReadLevel_Features_extraction.py $snv_bed $snv_out $bam_dir $REF $umap 4 bam > $snv_log', it always gives me inconsistent results, while when I run python ReadLevel_Features_extraction.py $snv_bed $snv_out $bam_dir $REF $umap 4 bam > $snv_log directly, it will output consistent results.

update1: if I use bsub -n 1 instead of bsub -n 5, the output is always consistent. update2: if I use bsub -n 5 and python ReadLevel_Features_extraction.py $snv_bed $snv_out $bam_dir $REF $umap 1 bam instead of python ReadLevel_Features_extraction.py $snv_bed $snv_out $bam_dir $REF $umap 4 bam, the output is always consistent.