jts / nanopolish

Signal-level algorithms for MinION data
MIT License
550 stars 160 forks source link

Reads number of nanopolish output #1132

Closed guangzhaocs closed 2 months ago

guangzhaocs commented 3 months ago

Hi,

My orginal fast5 reads number is 66736. But the logs of eventalign and ployA shows the number of total reads is 104358. I want to konw why it is.

Here are some log information:

un_modified_rep_1
[readdb] indexing /scratch/cs/infantbiome/chengg1/datasets/epinano/1_fast5/un_modified_rep_1/multi_fast5
[readdb] num reads: 66736, num reads with path to fast5: 66736
 *** nanopolish index is over.
[M::mm_idx_gen::0.007*0.66] collected minimizers
[M::mm_idx_gen::0.008*0.70] sorted minimizers
[M::main::0.008*0.70] loaded/built the index for 4 target sequence(s)
[M::mm_mapopt_update::0.008*0.71] mid_occ = 10
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 4
[M::mm_idx_stat::0.008*0.72] distinct minimizers: 1883 (98.94% are singletons); average occurrences: 1.014; average spacing: 5.309; total length: 10135
[M::worker_pipeline::11.949*1.75] mapped 66736 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -ax map-ont --MD -t 3 --secondary=no /scratch/cs/infantbiome/chengg1/datasets/epinano/0_reference/cc.fasta un_modified_rep_1.fastq
[M::main] Real time: 11.950 sec; CPU: 20.939 sec; Peak RSS: 0.182 GB
 *** minimap2 is over.
 *** samtools is over.
[post-run summary] total reads: 104358, unparseable: 0, qc fail: 3343, could not calibrate: 973, no alignment: 41, bad fast5: 0
 *** nanopolish eventalign is over.
[post-run summary] total reads: 104358, unparseable: 0, qc fail: 3343, could not calibrate: 973, no alignment: 41, bad fast5: 0
 *** polyA is over.

From the log info, we can see that I have 66736 reads. The logs of eventalign and ployA show that the total reads are 104358.

But for the results, I count the lines of ployA output file (un_modified_rep_1-polya.tsv), and there are 52180 lines.

$ wc -l un_modified_rep_1-polya.tsv
52180 un_modified_rep_1-polya.tsv

The first line of un_modified_rep_1-polya.tsv is the header. So in this file, there are 52180-1=52179 reads. And 52179*2=104358, which is the number in the logs.

I guess one read was processed two times in the eventalign and ployA commands. I want to know whether my guess is correct and why.

Thanks a lot!

jts commented 3 months ago

Yes, there was a double-counting bug in the log message but it should be fixed in the latest version from github. Did you install from conda?

guangzhaocs commented 3 months ago

Yes, previously I installed it from conda. And I re-install the latest version from GitHub, now it is normal. Thanks a lot.