lbcb-sci / herro

HERRO is a highly-accurate, haplotype-aware, deep-learning tool for error correction of Nanopore R10.4.1 or R9.4.1 reads (read length of >= 10 kbps is recommended).
Other
136 stars 9 forks source link

Poor assembly metrics with hifiasm v0.19.8 #35

Open chklopp opened 1 month ago

chklopp commented 1 month ago

Thank you for providing herro.

I've tested it on four ONT datasets one of which is public. The number reads and nucleotides after correction are very variable, ranging in my case from 2 to 20% for the reads and 2 to 60% of the nucleotides. What are these metrics looking like in the cases you've tested? The set which did not work has the highest coverage. Is there a coverage limit to respect? For the sets which did work I tried an hifiasm (v0.19.8) assembly but in all cases the metrics were poor. hifiasm log shows that their are remaining errors which are not removed by the 3 correction cycles.

For example for the public data set, data found in https://www.ncbi.nlm.nih.gov/bioproject/781898

Number of kmers found once in the read set = errors

grep 'ha_hist_line' slurm-7727010.out | grep ' 1:'
[M::ha_hist_line]     1: ****************************************************************************************************> 52175410
[M::ha_hist_line]     1: ****************************************************************************************************> 45429496
[M::ha_hist_line]     1: ****************************************************************************************************> 41842772
[M::ha_hist_line]     1: ****************************************************************************************************> 39899872

Compared to other assemblies this kmer error count stays very high it should drop quickly with correction cycles. And when I extract contig coverages from the gfa file they are very low while they should be around 10.

awk '/^S/{print $2"\t"$4"\t"$5}' hifiasm_0.19.8_no_HiC.bp.hap1.p_ctg.gfa \
| sed 's/LN:i://;s/rd:i://' | more
h1tg000001l 114415 6
h1tg000002l 1935040 3
h1tg000003l 485308 3
h1tg000004l 113763 0
h1tg000005l 54120 0
h1tg000006l 82359 0
h1tg000007l 3376377 2
h1tg000008l 505683 2
h1tg000009l 1826044 2
h1tg000010l 4045620 2
h1tg000011l 151854 1
h1tg000012l 172642 0
h1tg000013l 75530 0
h1tg000014l 82829 0
h1tg000015l 71160 0
h1tg000016l 944815 1
h1tg000017l 357347 3
h1tg000018l 207160 8
h1tg000019l 510563 5

Have you seen this before? What could I change to improve correction or assembly?