Closed gushiro closed 2 years ago
Those reads don't look high quality enough based on the 16-mer distribution but I'd still say try the uncorrected option. The k-mer histogram may look different w/the larger k-mer and the homopolymers compressed. If it still looks like the above, then your reads don't look high enough quality to skip correction. You can include the suggestions on the FAQ to decrease the space used.
@skoren thanks. I did what you suggest, what do you think (see below)? I can see there is peak ( frequency of 2) but not sure if it means the reads are of enough quality to skip the correction step.
PD: I am copying only part of the entire file.
Number of 22-mers that are:
unique 0 (exactly one instance of the kmer is in the input)
distinct 4030482973 (non-redundant kmer sequences in the input)
present 32827508257 (...)
missing 17588155561443 (non-redundant kmer sequences not in the input)
number of cumulative cumulative presence
distinct fraction fraction in dataset
frequency kmers distinct total (1e-6)
--------- ------------ ------------ ------------ ------------
2 1254841656 0.3113 0.0765 0.000061
3 662503087 0.4757 0.1370 0.000091
4 433655786 0.5833 0.1898 0.000122
5 313489010 0.6611 0.2376 0.000152
6 237268518 0.7200 0.2809 0.000183
7 184059664 0.7656 0.3202 0.000213
8 145135980 0.8016 0.3556 0.000244
9 115845145 0.8304 0.3873 0.000274
10 93407136 0.8535 0.4158 0.000305
11 75998201 0.8724 0.4412 0.000335
12 62388033 0.8879 0.4641 0.000366
13 51678258 0.9007 0.4845 0.000396
14 43207931 0.9114 0.5029 0.000426
15 36441369 0.9205 0.5196 0.000457
16 31008624 0.9282 0.5347 0.000487
17 26607229 0.9348 0.5485 0.000518
18 22995514 0.9405 0.5611 0.000548
19 20007515 0.9454 0.5727 0.000579
20 17517691 0.9498 0.5833 0.000609
21 15425425 0.9536 0.5932 0.000640
22 13658537 0.9570 0.6024 0.000670
23 12145486 0.9600 0.6109 0.000701
24 10853354 0.9627 0.6188 0.000731
25 9738236 0.9651 0.6262 0.000762
26 8769288 0.9673 0.6332 0.000792
27 7927700 0.9693 0.6397 0.000822
28 7188911 0.9710 0.6458 0.000853
29 6544789 0.9727 0.6516 0.000883
30 5970442 0.9741 0.6571 0.000914
31 5464074 0.9755 0.6622 0.000944
32 5022570 0.9767 0.6671 0.000975
33 4618440 0.9779 0.6718 0.001005
34 4257592 0.9789 0.6762 0.001036
35 3942236 0.9799 0.6804 0.001066
Are these counted in homopolymers compressed space?
@skoren It is the file ../0-mercounts/spp_sup.ms22.histogram created after I run Canu as: canu -d ID -p spp_sup minReadLength=5000 genomeSize=2.5g -untrimmed correctedErrorRate=0.12 maxInputCoverage=100 'batOptions=-eg 0.10 -sb 0.01 -dg 2 -db 1 -dr 3' -pacbio-hifi my.ont.fastq.gz
I don't see any peak there (other than the error peak) so I don't think skipping correction will work.
@skoren , what about this new distribution (corrected reads with masurca):
Looks better to me but I wonder if you think are OK to avoid the correction step.
Number of 22-mers that are:
unique 0 (exactly one instance of the kmer is in the input)
distinct 1827589066 (non-redundant kmer sequences in the input)
present 40744590960 (...)
missing 17590358455350 (non-redundant kmer sequences not in the input)
number of cumulative cumulative presence
distinct fraction fraction in dataset
frequency kmers distinct total (1e-6)
--------- ------------ ------------ ------------ ------------
2 39522448 0.0216 0.0019 0.000049
3 35123122 0.0408 0.0045 0.000074
4 51154993 0.0688 0.0095 0.000098
5 72721524 0.1086 0.0185 0.000123
6 94378455 0.1603 0.0324 0.000147
7 111619843 0.2213 0.0515 0.000172
8 121004121 0.2876 0.0753 0.000196
9 121485640 0.3540 0.1021 0.000221
10 114863871 0.4169 0.1303 0.000245
11 102955636 0.4732 0.1581 0.000270
12 90036682 0.5225 0.1846 0.000295
13 77830786 0.5651 0.2095 0.000319
14 67734055 0.6021 0.2328 0.000344
15 60347079 0.6351 0.2550 0.000368
16 55038408 0.6653 0.2766 0.000393
17 51238253 0.6933 0.2980 0.000417
18 48005332 0.7196 0.3192 0.000442
19 44818322 0.7441 0.3401 0.000466
20 41478396 0.7668 0.3604 0.000491
21 37879475 0.7875 0.3799 0.000515
22 34138346 0.8062 0.3984 0.000540
23 30243904 0.8227 0.4155 0.000564
24 26619910 0.8373 0.4311 0.000589
25 23277577 0.8500 0.4454 0.000614
26 20353915 0.8612 0.4584 0.000638
27 17818787 0.8709 0.4702 0.000663
28 15673425 0.8795 0.4810 0.000687
29 13896377 0.8871 0.4909 0.000712
30 12482389 0.8939 0.5001 0.000736
31 11268303 0.9001 0.5086 0.000761
32 10247712 0.9057 0.5167 0.000785
Yes, this looks better as you know have a peak at the 8-9x range though that's still pretty low given you should have 30x+ coverage.
Idle, answered
I was wondering if I can avoid the correction step of my assembly and jumping directly to the trimming step and so on. My reads have been called with the SUP models in Guppy5+, but not sure if the Kmer histogram of the reads shows low error rate. I basically want to avoid the error correction because it was taken over 5Tb in my previous runs and I don't have enough disk space.
The histogram does not show a peak as clear as I have seen it by another users who asked related question. Also, I am thinking to use a min read length of 5K to get the largest 25X at least: would it be OK?.
I should mention the species is a diploid with a genome size of 2.5gb and 3.5% of heterozygosity.
Thanks in advance. Here is the output of Meryl: