skovaka / uncalled4

MIT License
43 stars 3 forks source link

RNA004 support #11

Open mzdravkov opened 6 months ago

mzdravkov commented 6 months ago

Hi! I saw that you've been working on adding RNA004 support on the dev branch. I was wondering if you think it's already at a state that is functional enough to allow some preliminary testing by users?

skovaka commented 6 months ago

Yes! The dev RNA004 implementation is working pretty well. It's possible I'll refine it slightly after testing on modification data, but it already looks better than RNA002 on unmodified RNA. I added a note about it to the README.

mzdravkov commented 6 months ago

This is great! Thanks! I'll try it out over the weekend and may even be able to do some comparisons against Remora.

vopalenskyp commented 6 months ago

Hello, I am trying to use the Uncalled4 dev version for RNA004 data. I am using this command: uncalled4 align ref.fasta.fai paths foo/pod5_pass/ --bam-in doradobasecalled.bam --bam-out foo.aligned4.bam

The program is able to identify the flowcell etc, however then I get an error:

ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False

I checked the bam/sam file (large size, sequences included) and the header is as follows: @HD VN:1.6 SO:unknown @PG ID:basecaller PN:dorado VN:0.5.2+7969fab CL:dorado basecaller -x cpu --emit-moves --emit-sam fast /mnt/data/PV256/pv256/20240110_1119_MC-111221_FAX71856_9d4ea012/pod5_pass @PG ID:samtools PN:samtools PP:basecaller VN:1.13 CL:samtools view -H PV256_FAST.sam @RG ID:49b5d364678660a4f3c96fc45ef24d1dc53321aa_rna004_130bps_fast@v3.0.1 PU:FAX71856 PM:MC-111221 DT:2024-01-10T11:20:10.328+00:00 PL:ONT DS:basecall_model=rna004_130bps_fast@v3.0.1 runid=49b5d364678660a4f3c96fc45ef24d1dc53321aa LB:pv256 SM:pv256

Do you have any idea how fix this?

Thank you very much, Pavel

skovaka commented 6 months ago

Does your BAM file contain aligned reads, or are all reads unaligned? It looks like you didn't include a --reference flag for Dorado, which is required to align during basecalling. This exact error was caused by missing @SQ headers, which are only included for aligned BAMs. I realize I wasn't explicit about aligning during basecalling in the README, so I just expanded the "Overview" README section, where you can also find instructions to re-align your reads while preserving "move" tags.

vopalenskyp commented 6 months ago

You are right, indeed, I did not use a .sam file aligned to a reference. Now I am using another .sam that had been aligned and the reference during dorado basecalling, and uncalled4 seems to be processing and generating its output .bam file as it should. Great! Thank you so much also for including it to the documentation!