Closed jgunti-code closed 3 years ago
Hello Again, I am running through the pipeline you provided and was able to get all the way through to the Genotype Predictions step. After running I noticed the files were empty and tried numerous different attempts. For example, changing the id and the way the chromosome name was written i.e chr1 vs 1. I found that in all my files the mappability score for SNPs was zero which seems highly unlikely. Once I changed the mappability score, by heading a few lines and hard coding the mappability to 1, I was able to get an output file for SNP.predictions. If you have any advice as to why I only get a mappability score of zero that would be great! Thank you.
- Jonathan
Hi @jgunti-code ,
Sorry for the late reply!
There should be no difference whether you use "chr1" or "1". Have you tried running the test file and see if it works properly?
If you still generate the wrong score, could you try to run bigWigAverageOverBed to see if it works properly: bigWigAverageOverBed k24.umap.wg.bw test.bed test.output
As you can see from README and downloads.sh, the bigWigAverageOverBed program could be downloaded through: wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigAverageOverBed
Umap score (k=24, GRCh37/hg19) could be generated using this way: wget https://bismap.hoffmanlab.org/raw/hg19.umap.tar.gz tar -zxvf hg19.umap.tar.gz cd hg19 fetchChromSizes hg19> hg19.chrom.sizes wigToBigWig <(zcat k24.umap.wg.gz) hg19.chrom.sizes k24.umap.wg.bw
and the "test.bed" should be in the format of "chr1 pos-1 pos" separated by tabs.
Please tell me if you have further questions!
Thanks,
Yanmei
Hi Yanmei, I have followed that and it is all in order -- the Umap score is correct and there is no difference in chr1 or 1. When I generate an input file for phasing and read level features. I have around 71,000 potential mosaics. I am running phasing with 100GB of memory and 20 threads and the mappability is always zero. The depth of coverage is around 500x for exomes. The deeper sequencing is because we want to look for low-level mosaics. I have found when I split the input file by chromosome it will fail on the larger chromosomes and work on the smaller ones. It looks as though I can only have an input file of around 2500 potential mosaics otherwise the mappability fails. Is this the case always? Do I need more memory?
Thank you for your help!
Regards,
Jonathan Gunti
Hi Yanmei, I have followed that and it is all in order -- the Umap score is correct and there is no difference in chr1 or 1. When I generate an input file for phasing and read level features. I have around 71,000 potential mosaics. I am running phasing with 100GB of memory and 20 threads and the mappability is always zero. The depth of coverage is around 500x for exomes. The deeper sequencing is because we want to look for low-level mosaics. I have found when I split the input file by chromosome it will fail on the larger chromosomes and work on the smaller ones. It looks as though I can only have an input file of around 2500 potential mosaics otherwise the mappability fails. Is this the case always? Do I need more memory?
Thank you for your help! Regards, Jonathan Gunti
Hi @jgunti-code ,
The high read-depth would not require more memories when calculating mappability scores, because bigWigAverageOverBed extract mappability scores from "k24.umap.wg.bw" over a list of positions (without considering read depths). I haven't tried running over 71,000 potential mosaics, maybe the site list is a bit too big and requires more memory.
Could you try mask the two lines in "ReadLevel_Features_extraction.py" (add "#" in front of the line) and see if the mappability scores are generated properly:
subprocess.run("rm "+tmp_filename, shell=True) subprocess.run("rm "+tmp_filename+".2", shell=True)
Thanks a lot, and best wishes,
Yanmei
Yeah for some reason that also didn't work. This is the input file I am working with. I can attach the bam as well if need be! I also attached a partial output of some of the logfile. Thank you.
Hi @jgunti-code ,
I tried using your file to generate the mappability score and there's no problem. The memory usage is only ~3G. Please find the input file and the output file as attached: Archive.zip
BTW, can you check if you have generated the mappability score file (after masking the two lines I mentioned)? There should be two files and the file name should be something like this: f7361100-7aed-42b1-a56a-50f802814e86 f7361100-7aed-42b1-a56a-50f802814e86.2
Thanks,
Yanmei
Hi Yanmei, Those files do exist and I did check bigWigAverageOverBed and when I create the "test.bed" file using the input I shared with you I am also able to get the mappability scores. I guess it's only when running phase.py or ReadLevel_Features_extraction.py that if the file is over 3000 lines I get a mappability score of zero. I split by chromosome and any chromosome that was more than 3000 input lines will generate a score of zero. So I am at a loss at this point. Thanks, Jonathan
Hi Yanmei, Those files do exist and I did check bigWigAverageOverBed and when I create the "test.bed" file using the input I shared with you I am also able to get the mappability scores. I guess it's only when running phase.py or ReadLevel_Features_extraction.py that if the file is over 3000 lines I get a mappability score of zero. I split by chromosome and any chromosome that was more than 3000 input lines will generate a score of zero. So I am at a loss at this point. Thanks, Jonathan
Hi Jonathan,
I did not encounter this error before, 3000 lines is not a big number (I've run over 200,000 lines before). It's strange that everything works fine but the mappability score is zero... Could you share with me a bam file which could allow me to repeat the error? BTW, since I'm going to deliver a baby the day after tomorrow, not sure if I could help solve this quickly...
Thanks,
Yanmei
Hi Yanmei, Yes! I can share a bam! Thank you for trying. I hope the delivery went well. I was able to create a bed file with the mappability scores and then replace them with the ones generated with ReadLevel_Features_extraction.py. I still have the issue so when you are back I can share a bam with you. Please let me know how you want it shared. Thank you!
Hi Yanmei, Yes! I can share a bam! Thank you for trying. I hope the delivery went well. I was able to create a bed file with the mappability scores and then replace them with the ones generated with ReadLevel_Features_extraction.py. I still have the issue so when you are back I can share a bam with you. Please let me know how you want it shared. Thank you!
Hi @jgunti-code ,
Thanks for asking! You can share with me at anytime.
Best wishes,
Yanmei
Hi Yanmei, Here is a link to download a bam I am having trouble with: https://northwestern.box.com/s/yd75ogk5g29351w2du7nqcyvr42q1ls0
Thanks!
Kind Regards,
Jonathan Gunti
Hi Yanmei, Here is a link to download a bam I am having trouble with: https://northwestern.box.com/s/yd75ogk5g29351w2du7nqcyvr42q1ls0
Thanks!
Kind Regards,
Jonathan Gunti
Hi again,
It seems "This shared file or folder link has been removed or is unavailable to you".
Hi Yanmei, Sorry about that! This one should work. https://northwestern.box.com/s/yd75ogk5g29351w2du7nqcyvr42q1ls0 Thank you so much! Kind Regards, Jonathan Gunti
Hi Yanmei, Sorry about that! This one should work. https://northwestern.box.com/s/yd75ogk5g29351w2du7nqcyvr42q1ls0 Thank you so much! Kind Regards, Jonathan Gunti
Hi @jgunti-code ,
very sorry for the long wait...
I tried the docker image (MF-38) and could not repeat your error. It seems the code is running fine (show first ten lines of ther results here):
id conflict_num mappability type length GCcontent ref_softclip alt_softclip querypos_p leftpos_p seqpos_p mapq_p baseq_p baseq_t ref_baseq1b_p ref_baseq1b_t alt_baseq1b_p alt_baseq1b_t sb_p context major_mismatches_mean minor_mismatches_mean mismatches_p AF dp mosaic_likelihood het_likelihood refhom_likelihood althom_likelihoomapq_difference sb_read12_p dp_diff indel_proportion_SNPonly alt2_proportion_SNPonly EEG_004_B1_sort_markdup~chr1~13868~A~G 0 0.0 SNP 0 0.2380952380952381 0.0 0.0 0.5875938479556577 0.5875938479556577 0.4476990724652935 0.5508020125828832 0.744881619545956 -0.32539568672798425 0.6547208460185769 0.03779644730092272 1.0 0.0 1.0 TTA 0.0020689655172413794 0.006896551724137931 0.009236735486806607 0.375 16 0.3245686731661487 0.6754313268338513 2.7791093687910772e-18 7.007145886475594e-32 4.033333333333333 0.5879120879120876 -97.03571428571428 0.0 0.0 EEG_004_B1_sort_markdup~chr1~14775~C~T 0 0.0 SNP 0 0.6666666666666666 0.0 0.0 0.003582482206393849 0.003582482206393849 0.028216369234670847 0.029790254323248156 0.5393736440818824 -0.613760437488925 0.07404722487654518 0.9542374472273957 0.5574801348628696 0.30436547097354627 0.44674704957827305 GCG 0.0017630282603059387 0.009895052473763124 1.660172640317226e-11 0.07958477508650519 289 1.0 2.0379786261021864e-51 3.858856142004603e-84 0.0 -6.115233736515201 0.6650723931076643 258.92857142857144 0.003448275862068965 0.0 EEG_004_B1_sort_markdup~chr1~17147~G~A 0 0.0 SNP 0 0.5714285714285714 0.0 0.0 0.6585056368711135 0.6585056368711135 0.19334914733846387 0.17202778386823592 0.7338696813515619 -0.3399825337497914 4.2267963975397085e-05 1.1893181550249474 0.49624247444426295 0.346651238158576 0.48180578960313547 CCT 0.0034573264548827987 0.007662835249042149 2.530328130773592e-10 0.08633093525179857 417 1.0 1.796361705588699e-71 3.499673885751355e-90 0.0 -3.1404199475065617 0.4877010200790722 701.7678571428571 0.0 0.0 EEG_004_B1_sort_markdup~chr1~17172~G~A 0 0.0 SNP 0 0.6190476190476191 0.0 0.0 0.06001758154017459 0.06001758154017459 0.29854703000082217 0.4907687748912911 0.1373334640493644 1.4857945806548658 0.12478645381458307 0.6915130929993575 0.0832645166635504 -0.5703518254720301 0.06742999383237483 CCT 0.0028508542081542854 0.009587888982338104 6.631060252998256e-15 0.11081081081081082 370 1.0 1.0341153601286023e-54 1.8285554524466946e-91 0.0 1.8428349025131587 0.8669070878311782 710.8571428571429 0.0 0.10154525386313466 EEG_004_B1_sort_markdup~chr1~17767~G~A 0 0.0 SNP 0 0.4761904761904762 0.0 0.0 0.6706403461201665 0.6706403461201665 0.4783595689668002 0.39502567206274863 0.47895128907746676 0.7079902873238717 0.04187904905857302 0.7685812837314336 0.2422880965362707 0.4803844614152614 0.38548714370579146 TCC 0.002063535161553082 0.007151979565772674 3.862968821950472e-12 0.17532467532467533 154 0.9999999999999933 6.645182008656438e-15 5.864794983713357e-87 0.0 -4.690871974336542 0.8340029811977039 7.285714285714306 0.0 0.0 EEG_004_B1_sort_markdup~chr1~182785~A~G 0 1.0 SNP 0 0.6190476190476191 0.0 0.0 6.574141203424211e-10 7.287412340337113e-10 8.700370376733047e-15 8.421824894591439e-29 2.0386544209594985e-07 5.195777987867754 0.2486861710859124 0.48728107661034414 0.28504940740261275 0.15944820103582016 3.264987790935439e-19 ATG 0.004077523282154549 0.013218390804597713 1.8988636961475827e-19 0.30456852791878175 197 0.9999976140865535 2.3859134464886526e-06 1.4774026302940673e-171 0.0 52.857785888077856 0.5373437081321408 106.10714285714286 0.0 0.0 EEG_004_B1_sort_markdup~chr1~182875~G~A 0 0.0 SNP 0 0.7142857142857143 0.0 0.0 0.0007650595188365523 0.0007650595188365523 0.4439387465367979 0.0033861853734253678 0.7767331300207586 0.2835788724805662 0.4142161782425252 0.12638738648169778 1.0 0.0 0.24027352149597245 CCG 0.0014840680925360104 0.015172413793103448 0.0002190419658141495 0.05952380952380952 84 0.9999999991481192 1.4170584352699602e-16 8.518806462406944e-10 2.738256242797951e-308 33.07848101265823 0.36040458128811553 131.71428571428572 0.0 0.0 EEG_004_B1_sort_markdup~chr1~183401~C~G 0 0.0 SNP 0 0.4761904761904762 0.0 0.0 0.07366111348452424 0.07366111348452424 0.18454106850439933 0.26359032534434523 0.6715306292394592 0.4240482445287109 0.3951080685904922 0.2765415273789217 1.0 0.0 0.16019112029046823 GGC 0.00289210233592881 0.007523510971786834 0.0005682018579942549 0.1506849315068493 73 0.9999999715499097 2.8450090259612778e-08 5.2860472092455556e-27 1.6916871575389062e-235 -6.032258064516128 0.11102298008401905 21.142857142857167 0.0 0.0 EEG_004_B1_sort_markdup~chr1~185312~C~T 0 0.0 SNP 0 0.7142857142857143 0.0 0.0 0.37054316863815384 0.37054316863815384 0.00407980673036277 0.21647380367544722 0.1303768784134333 -1.5126173738866993 0.04181108318004076 1.1171115157459985 0.15729920705028502 0.2637521893583148 0.12320805867053745 CCG 0.001983623572828973 0.007471264367816092 8.223750702091035e-06 0.03858520900321544 311 1.0 1.2148949913070974e-70 4.8089939054990855e-51 0.0 -5.578316610925307 0.7724202838915619 278.08928571428567 0.003205128205128205 0.0 EEG_004_B1_sort_markdup~chr1~185469~G~A 0 0.0 SNP 0 0.9047619047619048 0.003205128205128205 0.0 0.8961247779728454 0.9014237905781322 0.14076775323002116 0.5263796072032322 0.4810095610845384 0.7046797418975845 0.01649524892792101 1.7365238940568892 0.3961439091520741 0.6082269936647244 0.6019358018021199 CGG 0.005813439434129098 0.02015915119363396 2.0885838047359549e-19 0.1111111111111111 351 1.0 8.453955315749353e-52 2.3366772374598553e-143 0.0 -1.6121794871794874 1.0 169.16071428571433 0.002840909090909091 0.001941747572815534
Hello Again, I am running through the pipeline you provided and was able to get all the way through to the Genotype Predictions step. After running I noticed the files were empty and tried numerous different attempts. For example, changing the id and the way the chromosome name was written i.e chr1 vs 1. I found that in all my files the mappability score for SNPs was zero which seems highly unlikely. Once I changed the mappability score, by heading a few lines and hard coding the mappability to 1, I was able to get an output file for SNP.predictions. If you have any advice as to why I only get a mappability score of zero that would be great! Thank you.