nh13 / DWGSIM

Whole Genome Simulator for Next-Generation Sequencing
GNU General Public License v2.0
92 stars 36 forks source link

How to generate quality score closing to NA12878. #69

Closed liyewen521 closed 2 years ago

liyewen521 commented 3 years ago

Hello, this tool is very useful, thanks for your contributing!

And I can generate fastq.gz files from [reference sequence hg19] (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz), and the fastq file is as following:

@1_174180915_174181318_0_1_0_0_1:0:0_0:0:0_0/1 GATGGAAAAAGATATTCCATGTCAATGGAAACCAAAAAGGAGCAGGAATAGCTATCCTTCTATCAGACAAAATTGATTTTAAGACAAAAACTATTAGAAGAGACAAAGAAGATCACTGTATAATGACTAAAAGGTCAATTCAGCAAGAGG + 223622512244522232226443312732220050202134223211222227221742233.4022-32562252524015141121521226231420542562322262321822443212023224322610422/253332132 @1_85886768_85887186_0_1_0_0_3:0:0_2:0:0_1/1 CTCACCCACTATTGCGAGGACAGCACCAAGGAGATGGTGCTAAACCATTCATGGGAAAATTGACCCCATGATCCAATCACCTCCCATCAGGCCCAACCTCCAACACTGGGGATTATAATTCAACATGAGACTTGAATGGGGACACAGATC + 22422232332222222312462.1042145171112233/2120222424014.354302453624202322200301324462533130144143245522125/25523.1022310562220422612223412122.03115241 @1_113447338_113447029_1_0_0_0_3:1:0_4:0:0_2/1 TAATATTAGAACTTAACTTACAGGGTCGCCCAGATAATTAAATCTATAAAACCCTTAGCATAGTGCCTGACATGGAATAAATGCTCACTGTGTATTAATTTTGCTTACTCTTTCACAGATACCACTATTAAAGAATAGTTTTCAAAATGA + 11042/11/423225244452223633253816122331154125542/5103425450443313323323343022112/2122233022.233-33/6202226016252532342342032633304204372245422322/2203

But NA12878 fastq file is as following:

@ERR194147.1 HSQ1004:134:C0D8DACXX:1:1104:3874:86238/1 GGTTCCTACTTCAGGGTCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACG + CC@FFFFFHHHHHJJJFHIIJJJJJJIHJIIJJJJJJJJIIGIJJIJJJIJJJIJIJJJJJJJJJJIJHHHHFFFDEEEEEEEEDDDCDDEEDDDDDDDDD @ERR194147.2 HSQ1004:134:C0D8DACXX:2:2104:2852:75174/1 ACTTCAGGGTCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTC + @BBFDFFFHFFHHIHIJJJJFIHHFHFHJCIHFHIJJJJJJJIJIJIJJIIHIJJJJJJJBEGIGHIHGHHHEFCDFFEDEEDEEDDD?CCCDDDDDDDDC @ERR194147.3 HSQ1004:134:C0D8DACXX:3:1101:1318:114841/1 CCAGCGTCTCGCAATGCTATCGCGTGCATACCCCCCAGACGAAAATACCAAATGCATGGAGAGCTCCCGTGAGTGGTTAATAGGGTGATAGACCTGTGATC + @@@FFFFFHHFFHGJJCGGHHIIGFHHGIGIF>GHIJJIJJFHGFCDD@CCDECDDEDD::<?CCABDAB8?::<@<@ACCDD9?BDECCCDDDCDCCD

How can I generate read score closed to NA12878 quality score? I know this is controlled by -q and -e, But I don't know how to set a accurate number.

Thanks!

nh13 commented 2 years ago

@liyewen521 this tool doesn't do the best job of model base qualities, but if you computed the mean qualities from the NA12878 FASTQ, you could then set them with -q/-Q.