Failed to use R9.4 kmer table

Modernism-01 commented 8 months ago

Hi, remora developers,

When I attempted to use the Kmer model for the R9.4 DNA ONT sample, I had familiar trouble with the command parameter:

remora dataset prepare ----refine-kmer-level-table

I downloaded the file from the website: https://github.com/nanoporetech/kmer_models/blob/master/legacy/legacy_r9.4_180mv_450bps_6mer/template_median68pA.model

But I got the error information. according to the below parameter, I don't know how to transform the legacy model into an applicable form.

--refine-kmer-level-table REFINE_KMER_LEVEL_TABLE
Tab-delimited file containing no header and two fields: 1. string k-mer sequence and 2. float expected normalized level. 
All k-mers must be the same length and all combinations of the bases 'ACGT' must be present in the file. (default: None)

Any suggestions would be appreciated! Thank you.

Best,

Lingsen

Modernism-01 commented 8 months ago

Or can you tell me how to calculate the nomorlized level according to the legacy table ?

kmer    level_mean  level_stdv  sd_mean sd_stdv weight
AAAAAA  86.486336   1.517846    0.941478    0.609357    4739.559092
AAAAAC  83.948838   1.517846    1.077051    0.745608    3403.762207
AAAAAG  85.475368   1.517846    0.953434    0.621001    3553.658043
AAAAAT  84.423907   1.517846    1.106077    0.775951    3909.663587
AAAACA  77.097281   1.705015    1.296857    0.936213    2630.415221
AAAACC  75.706298   1.705015    1.319392    0.960721    3098.573146
AAAACG  76.928240   1.705015    1.302069    0.941863    3267.614279

marcus1487 commented 8 months ago

The normalized levels are not required. If you remove the header line and only supply the first two fields this should work. The table will automatically be normalized to standard units. If this does not work can you post the beginning of this file and the error you are seeing?

Modernism-01 commented 8 months ago

Thank you for your positive reply. I have obtained the first two fields of kmer table as the input of the --refine-kmer-level-table parameter:

AAAAAA  86.486336
AAAAAC  83.948838
AAAAAG  85.475368
AAAAAT  84.423907
AAAACA  77.097281
AAAACC  75.706298
AAAACG  76.928240
AAAACT  76.811415

and this is the running command:

remora dataset prepare \
  $pod5 $bam \
  --output-path $output \
  --refine-kmer-level-table $kmer \
  --refine-rough-rescale \
  --motif CG 0 \
  --mod-base m 5mC

However, this command does not report an error and end when calculating the test sample (1.2 G pod5 files), which has been running for over 6 hours. Now I stopped running command. The log file is shown as follows:

Indexing BAM by parent read id: 11290 Reads [00:08, 1276.33 Reads/s]
[13:04:01.558] Extracting read IDs from POD5
[13:04:02.742] Found 8,007 valid BAM records. Found signal in POD5 for 100.00% of BAM records.
[13:04:02.743] Making reference-anchored training data
[13:04:02.743] Opening dataset for output
[13:04:03.239] Processing reads
Extracting chunks:   0%|          | 0/8000 [00:00<?, ? Reads/s]slurmstepd: error: *** JOB 12940895 ON comput7 CANCELLED AT 2023-12-24T19:18:33 ***

under the output dictionary, there are several below files:

-rw-r--r-- 1 zenglingsen yi 939K 12月 24 13:04 extra_read_focus_bases.npy
-rw-r--r-- 1 zenglingsen yi  17M 12月 24 13:04 extra_read_ids.npy
-rw-r--r-- 1 zenglingsen yi  17K 12月 24 13:04 kmer_table.npy
-rw-r--r-- 1 zenglingsen yi 939K 12月 24 13:04 labels.npy
-rw-r--r-- 1 zenglingsen yi 345K 12月 24 13:07 log.txt
-rw-r--r-- 1 zenglingsen yi  770 12月 24 13:04 metadata.jsn
-rw-r--r-- 1 zenglingsen yi 235K 12月 24 13:04 sequence_lengths.npy
-rw-r--r-- 1 zenglingsen yi  11M 12月 24 13:04 sequence.npy
-rw-r--r-- 1 zenglingsen yi 19M 12月 24 13:04 sequence_to_signal_mapping.npy
-rw-r--r-- 1 zenglingsen yi 184M 12月 24 13:04 signal.npy

this is the part of contents in log.txt :

DEBUG [13:03:51.709:MainProcess:MainThread:log.py:68] Command: """/public/home/zenglingsen/04.software/02.Anaconda/Or/envs/guppy_env/bin/remora dataset prepare /public/home/zenglingsen/03.test/04.dorado/01.pod5/output.pod5 /public/home/zenglingsen/03.test/05.deepmod2/04.dorado_basecalling/test.sorted.bam --output-path /public/home/zenglingsen/03.test/06.remora/01.datatset/mod_chunks --refine-kmer-level-table /public/home/zenglingsen/04.software/13.models.libraries/dna_r9.4.1_e8.2_400bps_normalized_level.txt --refine-rough-rescale --motif CG 0 --mod-base m 5mC"""
DEBUG [13:03:51.788:MainProcess:MainThread:refine_signal_map.py:282] K-mer index stats:
        0            11.54
        1            89.21
        2          3164.64
        3           342.65
        4           157.19
        5            16.10

DEBUG [13:03:51.789:MainProcess:MainThread:refine_signal_map.py:283] Choosen central position: 2
DEBUG [13:03:52.606:MainProcess:MainThread:io.py:284] 6c89bf82-e862-4d9e-80e6-628900e84c68 not primary
DEBUG [13:03:52.642:MainProcess:MainThread:io.py:284] cb104cc7-efbb-4bef-af78-764c21d5199a not primary
DEBUG [13:03:52.643:MainProcess:MainThread:io.py:284] cb104cc7-efbb-4bef-af78-764c21d5199a not primary

In addition, I use an alternative kmer table for R10.4 https://github.com/nanoporetech/kmer_models/blob/master/dna_r10.4.1_e8.2_400bps/9mer_levels_v1.txt, despite it is not suitable for my samples. But it can be successfully finished in ten minutes.

Indexing BAM by parent read id: 11290 Reads [00:06, 1756.71 Reads/s]
[14:50:57.353] Extracting read IDs from POD5
[14:51:02.120] Found 8,007 valid BAM records. Found signal in POD5 for 100.00% of BAM records.
[14:51:02.486] Making reference-anchored training data
[14:51:02.486] Opening dataset for output
[14:51:04.141] Processing reads
Extracting chunks: 100%|██████████| 8000/8000 [06:35<00:00, 20.25 Reads/s]  
[14:57:39.942] Unsuccessful read/chunk reasons:
  2,295 : Sequence too long                                                               
    104 : No reference sequence (missing MD tag)                                          
     45 : Move table discordant with signal                                               
[14:57:41.454] Extracted 111,482 chunks from 8,007 reads.
[14:57:41.457] Label distribution: control:0; 5mC:111,482
[14:57:41.457] Shuffling dataset
[14:57:42.888] Done

Now I do not know how to use remora correctly, I hope you can give me more suggestions about this question.

Modernism-01 commented 8 months ago

I have finished the remora dataset prepare procedure. Thank you.

nanoporetech / remora

Failed to use R9.4 kmer table #149