skovaka / uncalled4

MIT License
43 stars 3 forks source link

Uncalled4 to m6Anet dataprep #28

Open ssomalra opened 3 months ago

ssomalra commented 3 months ago

Hello,

I successfully ran uncalled4 align using the --eventalign-out parameter to obtain signal alignments. I want to use this output file as input for m6Anet dataprep, but the output files from m6Anet dataprep are empty.

I am uncertain if I am using the correct approach to transition from uncalled4 to m6Anet dataprep and would appreciate some clarification.

Here are the commands and outputs I am using:

Uncalled4:

 uncalled4 align GCF_000002765.6_GCA_000002765_genomic_plasmo.fna pod5/ --bam-in dorado_out/dorado.sorted.bam --eventalign-out uncalled_eventalign.txt

I should note that I ran Dorado with the --emit-moves --emit-sam and --reference parameters.

Here are some lines from the uncalled4 output:

Screen Shot 2024-07-12 at 11 33 27 AM


Attached a portion of output file from Uncalled4: uncalled_eventalign.txt
m6Anet dataprep:

m6anet dataprep --eventalign uncalled_eventalign.txt --out_dir m6Anet_uncalled/ --n_processes 4

The screenshot below shows the number of lines in the output files from m6Anet dataprep:

Screen Shot 2024-07-12 at 11 39 29 AM


Any help would be greatly appreciated. Thanks in advance!

skovaka commented 3 months ago

I think you need to add --eventalign-flags signal-index,samples. However, I actually implemented a much quicker alternative to m6Anet dataprep! Try this:

uncalled4 align ... --bam-out out.bam
samtools sort out.bam -o sorted.bam
uncalled4 convert --bam-in sorted.bam --m6anet-out dataprep
m6anet inference --input_dir dataprep --model_state_dict m6anet_model/pr_auc.pr --norm_path m6anet_model/norm_dict_nanopolish.joblib  ...

Using the Uncalled4 trained m6anet_model provided here: https://figshare.com/articles/dataset/Uncalled4_Supplemental_Data/25336195/1

ssomalra commented 3 months ago

Thank you for the clarification!

When I run the uncalled4 convert command, I encounter the following error:

Screenshot 2024-07-16 at 11 11 32 AM

Do you know how I could resolve this?

skovaka commented 3 months ago

What sequencing chemistry are you using? We only support RNA001 or RNA002 for m6Anet, since I have not trained an m6Anet model for RNA004. This problem could be caused by the RNA004 model using 9-mers, while m6Anet expects 5-mers. If it's not RNA004 then it could be an issue with ambiguous nucleotide (e.g. "N"s), which I could fix if needed

ssomalra commented 3 months ago

Oh okay that makes sense. Yes I am using RNA004.