marbl / merqury

k-mer based assembly evaluation
Other
272 stars 19 forks source link

QV is inf #95

Closed lnyawen closed 1 year ago

lnyawen commented 1 year ago

Hello @arangrhie,

Using HiFi reads, I assembled the genomes of two different primates and assessed them. Given that the primate genome is around 3G in size, I chose a kmer of 21, and the results for both .qv were +inf. Furthermore, there were no errors noted in the log file. Is there a problem with the results I'm running, or is this because the results assembled using HiFi readings are too good? The .qv file is shown below aloCar 0 128774716107 inf 0

Thank you for your help

Yawen

arangrhie commented 1 year ago

Hello Yawen,

Seems like you provided the reads fa not the assembly. Each column in the .qv file is reporting:

  1. assembly of interest. Both is combined of the above two.
  2. k-mers uniquely found only in the assembly
  3. k-mers found in both assembly and the read set
  4. QV
  5. Error rate

There are 128774716107 k-mers found, which roughly could be translated as 128 Gb. This seems way larger than your expected 3G primate genome.

Best, Arang

lnyawen commented 1 year ago

Hi Arang,

I double-checked that the reads fa were used for genome assembly and re-ran the program with the same results like before. Then I must have made an error in the assessment somewhere. I use hifiasm to produce a primary assembly base HiFi reads. Here are the commands I used to run merqury.

meryl k=21 count */reads/*.fasta output genome.meryl
$MERQURY/merqury.sh genome.meryl genome.fa aloCar

Are there any errors in the commands I'm using? Or, what could be wrong to cause the error?

Thanks.

arangrhie commented 1 year ago

Can you double check what's in your genome.fa? Just open it, read what's in there. And to make sure genome.meryl is from the reads, you could do a meryl statistics genome.meryl | head and see how many distinct / present k-mers are in there. Then try a fresh run of merqury inside a new directory.

lnyawen commented 1 year ago

Hello, I double checked my genome.fa, and didn't see anything unusual.

>ptg000001l 29205752
ccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacc

And I sure genome.meryl is from the reads. Using the command meryl statistics genome.meryl | head get the following result

Found 1 command tree.
Number of 21-mers that are:
  unique             2903014934  (exactly one instance of the kmer is in the input)
  distinct           5696767959  (non-redundant kmer sequences in the input)
  present          128774716107  (...)
  missing         4392349743145  (non-redundant kmer sequences not in the input)

             number of   cumulative   cumulative     presence
              distinct     fraction     fraction   in dataset
frequency        kmers     distinct        total       (1e-6)
--------- ------------ ------------ ------------ ------------

Besides, I did the calculation in another directory as you said and still got the same result as before. What do you think the problem to be?

arangrhie commented 1 year ago

Oh, are you actually using names like genome.meryl and genome.fa? Could you rename or symlink the fa (or read meryl) to some other name, like asm_genome.fa and see if this fixes the issue? There is an intermediate step where Merqury generates kmers for genome.fa naming it genome.meryl. It checks if this genome.meryl already exists, and skips it. So what happened, is that it used genome.meryl once for the reads and once for the assembly.

lnyawen commented 1 year ago

Hello, Sorry for getting back to you late, since our cluster recently crashed. I rename the genome.fa, and get the result like

ptg000001l      614     29205732        59.9951 1.00112e-06
ptg000002l      1733    28930369        55.4476 2.85258e-06
ptg000004l      2323    28902157        54.1708 3.82751e-06
ptg000003l      6431    130221249       56.2861 2.35173e-06
ptg000005l      9083    193548461       56.5077 2.23476e-06
ptg000009l      47628   80758128        45.5142 2.80918e-05
ptg000008l      3483    82045046        56.9431 2.02158e-06
ptg000007l      3633    95797282        57.433  1.80593e-06
ptg000010l      962     109189894       63.7722 4.19542e-07
ptg000006l      4967    111080099       56.7175 2.12935e-06
ptg000013l      1966    45419350        56.8587 2.06126e-06
ptg000011l      3458    133456698       59.0873 1.23387e-06
ptg000012l      11627   231337207       56.2099 2.39339e-06
ptg000014l      734     14506840        56.1809 2.40943e-06
ptg000015l      3128    50587482        55.3098 2.94454e-06
ptg000017l      5907    124641354       56.465  2.25681e-06
ptg000016l      11339   248422849       56.6283 2.17357e-06
ptg000019l      0       20003947        inf     0
ptg000021l      4408    129746190       57.9107 1.61784e-06

It looks all right.

Thanks for your help!

Yawen

caaswangzhen commented 3 months ago

ptg000019l 0 20003947 inf 0 Why there are still some sequences QV is inf.

I encountered a similar issue.

This is the QV of each chromosome chr01 16 44979429 77.2765 1.8722e-08 chr02 0 39075466 inf 0 chr03 0 40194687 inf 0 chr04 5 37830077 81.5762 6.95631e-09 chr05 8627 31667954 48.4346 1.43398e-05 chr06 0 32120109 inf 0 chr07 31 30523787 72.7203 5.34527e-08 chr08 12 31892237 77.0326 1.98035e-08 chr09 42485 24976068 40.4769 8.96e-05 chr10 50159 25631225 39.8677 0.000103093 chr11 0 32232251 inf 0 chr12 0 27951201 inf 0

This is the QV of complete genome T2T 101335 399074491 48.74 1.33661e-05

The significant difference in QV values calculated by these two methods might be due to the presence of 'inf' QV values in the chromosome-wise statistics.