2x slowdown with modified RNA basecalling

Hi, dorado v0.53 becomes slow with official m6A_DRACH model --modified-bases-models rna004_130bps_sup@v3.0.1_m6A_DRACH@v1 and uses heavily CPU (load over 10 on 16-core CPU) instead of GPU!

[▉                             ] 2% [04h:18m:18s<05d:19h:11m:55s]

Running the same model rna004_130bps_sup@v3.0.1 without m6A_DRACH detection is ~2x faster and uses less CPU (load 3).

[█▌                            ] 5% [04h:09m:03s<03d:06h:52m:05s]

So the difference seems to be new modification calling algorithm using CPU instead of GPU.

I/O isn't a problem - reading pod5 from local disk, data is hardly access, low GPU use (2x RTX 3080 Ti). Could this be improved somehow?

My command lines:

~/src/dorado-0.5.3-linux-x64/bin/dorado basecaller rna004_130bps_sup@v3.0.1 --modified-bases-models rna004_130bps_sup@v3.0.1_m6A_DRACH@v1 --estimate-poly-a --emit-sam --emit-moves --reference $ref $f
~/src/dorado-0.5.3-linux-x64/bin/dorado basecaller rna004_130bps_sup@v3.0.1--estimate-poly-a --emit-sam --emit-moves --reference $ref  $f

It's similar to #343 but for RNA, using pod5 and with most recent dorado version.

nanoporetech / dorado

2x slowdown with modified RNA basecalling #644