Closed lnyawen closed 1 year ago
Hello Yawen,
Seems like you provided the reads fa not the assembly. Each column in the .qv file is reporting:
There are 128774716107 k-mers found, which roughly could be translated as 128 Gb. This seems way larger than your expected 3G primate genome.
Best, Arang
Hi Arang,
I double-checked that the reads fa were used for genome assembly and re-ran the program with the same results like before. Then I must have made an error in the assessment somewhere. I use hifiasm to produce a primary assembly base HiFi reads. Here are the commands I used to run merqury.
meryl k=21 count */reads/*.fasta output genome.meryl
$MERQURY/merqury.sh genome.meryl genome.fa aloCar
Are there any errors in the commands I'm using? Or, what could be wrong to cause the error?
Thanks.
Can you double check what's in your genome.fa
? Just open it, read what's in there.
And to make sure genome.meryl
is from the reads, you could do a meryl statistics genome.meryl | head
and see how many distinct / present k-mers are in there.
Then try a fresh run of merqury inside a new directory.
Hello, I double checked my genome.fa, and didn't see anything unusual.
>ptg000001l 29205752
ccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacc
And I sure genome.meryl
is from the reads. Using the command meryl statistics genome.meryl | head
get the following result
Found 1 command tree.
Number of 21-mers that are:
unique 2903014934 (exactly one instance of the kmer is in the input)
distinct 5696767959 (non-redundant kmer sequences in the input)
present 128774716107 (...)
missing 4392349743145 (non-redundant kmer sequences not in the input)
number of cumulative cumulative presence
distinct fraction fraction in dataset
frequency kmers distinct total (1e-6)
--------- ------------ ------------ ------------ ------------
Besides, I did the calculation in another directory as you said and still got the same result as before. What do you think the problem to be?
Oh, are you actually using names like genome.meryl
and genome.fa
?
Could you rename or symlink the fa (or read meryl) to some other name, like asm_genome.fa
and see if this fixes the issue?
There is an intermediate step where Merqury generates kmers for genome.fa
naming it genome.meryl
.
It checks if this genome.meryl
already exists, and skips it. So what happened, is that it used genome.meryl
once for the reads and once for the assembly.
Hello, Sorry for getting back to you late, since our cluster recently crashed. I rename the genome.fa, and get the result like
ptg000001l 614 29205732 59.9951 1.00112e-06
ptg000002l 1733 28930369 55.4476 2.85258e-06
ptg000004l 2323 28902157 54.1708 3.82751e-06
ptg000003l 6431 130221249 56.2861 2.35173e-06
ptg000005l 9083 193548461 56.5077 2.23476e-06
ptg000009l 47628 80758128 45.5142 2.80918e-05
ptg000008l 3483 82045046 56.9431 2.02158e-06
ptg000007l 3633 95797282 57.433 1.80593e-06
ptg000010l 962 109189894 63.7722 4.19542e-07
ptg000006l 4967 111080099 56.7175 2.12935e-06
ptg000013l 1966 45419350 56.8587 2.06126e-06
ptg000011l 3458 133456698 59.0873 1.23387e-06
ptg000012l 11627 231337207 56.2099 2.39339e-06
ptg000014l 734 14506840 56.1809 2.40943e-06
ptg000015l 3128 50587482 55.3098 2.94454e-06
ptg000017l 5907 124641354 56.465 2.25681e-06
ptg000016l 11339 248422849 56.6283 2.17357e-06
ptg000019l 0 20003947 inf 0
ptg000021l 4408 129746190 57.9107 1.61784e-06
It looks all right.
Thanks for your help!
Yawen
ptg000019l 0 20003947 inf 0 Why there are still some sequences QV is inf.
I encountered a similar issue.
This is the QV of each chromosome chr01 16 44979429 77.2765 1.8722e-08 chr02 0 39075466 inf 0 chr03 0 40194687 inf 0 chr04 5 37830077 81.5762 6.95631e-09 chr05 8627 31667954 48.4346 1.43398e-05 chr06 0 32120109 inf 0 chr07 31 30523787 72.7203 5.34527e-08 chr08 12 31892237 77.0326 1.98035e-08 chr09 42485 24976068 40.4769 8.96e-05 chr10 50159 25631225 39.8677 0.000103093 chr11 0 32232251 inf 0 chr12 0 27951201 inf 0
This is the QV of complete genome T2T 101335 399074491 48.74 1.33661e-05
The significant difference in QV values calculated by these two methods might be due to the presence of 'inf' QV values in the chromosome-wise statistics.
Hello @arangrhie,
Using HiFi reads, I assembled the genomes of two different primates and assessed them. Given that the primate genome is around 3G in size, I chose a kmer of 21, and the results for both .qv were +inf. Furthermore, there were no errors noted in the log file. Is there a problem with the results I'm running, or is this because the results assembled using HiFi readings are too good? The .qv file is shown below
aloCar 0 128774716107 inf 0
Thank you for your help
Yawen