skovaka / uncalled4

MIT License
43 stars 3 forks source link

Seg fault in align #4

Closed cnk113 closed 7 months ago

cnk113 commented 8 months ago

Hi,

I'm running Uncalled4 with direct RNA data from dorado basecalled and aligned reads. Command:

uncalled4 align --bam-in filtered.bam --rna -o uc4.bam ../../ref.fasta fast5_pass

However I get a seg fault (near the end?) of the alignment:

Using model `rna_r9.4.1_70bps_5mer`
Warning: failed to open read d6f....
Segmentation fault (core dumped)

I removed the read name for brevity, I assume it's due to the missing read; would there be a possibility to just skip missing reads?

Thanks, Chang

skovaka commented 8 months ago

You're right, it looks like because of the missing read, which should not result in a segmentation fault. I just added a check for that and pushed it, it should now just output the warning and continue

cnk113 commented 8 months ago

Hmm I downloaded the latest push but I still get segmentation fault (core dumped)... I get two reads skipped and then the seg fault.

skovaka commented 8 months ago

Sorry for the issues! Are any alignments successfully output? And did you basecall with the --emit-moves option? That option should add "mv:" and "ts:" tags to the BAM file. If those tags are missing from any reads it could cause a problem, although it should just print a warning and not segfault.

Also, was your data sequenced with RNA002 or RNA004? We currently only support RNA002, but hoping to add 004 support soon. You could try removing the --rna flag since Uncalled4 can auto-detect the chemistry in most cases. The --rna flag is mainly for use with a custom pore model, and it shouldn't cause any issues but it's worth trying to remove it.

cnk113 commented 8 months ago

The run is RNA002 and when setting --rna which also shows that RNA002 is being auto-detected. Unsetting --rna still results in the same issue. I'm going from dorado with --emit-moves and for alignment directly by setting --reference I see the tags mv and ts in the bam file.

I'm trying on a different run and now I getting an error on a missing read_index file, I was wondering if it possible to run it w/o it? Since the dataset I'm trying is from a publication I don't think I can get the read_index file.

skovaka commented 8 months ago

The read_index should be built automatically, what is the error you're getting? If you don't have write permission to your FAST5 directory then that could cause an error, which I could add a workaround for. You could also input a "sequencing_summary.txt" to the "--read-index" flag which can serve as a read index.

In your initial attempt, does it fail after aligning some reads, or do all reads fail? If all reads fail, is it possible for you to share a small example FAST5 and BAM file for me to test? Otherwise, could you just send me the BAM entry for the read which is failing? Sorry again for the trouble

cnk113 commented 8 months ago

So I've resolved the read-index issue.

I've attached a test fast5 as well as the bam file here. test.tar.gz Thanks for the quick response!

cnk113 commented 8 months ago

I've done it using the example dataset from the repo and it works, however for the browser plotting can we forward the host rather than have it hardcoded 127.0.0.1?

skovaka commented 8 months ago

You can change the port via the --port option. To access remotely I generally use SSH port forwarding, for example to connect local port 8000 to remote port 8000: ssh -NL 8000:localhost:8000 <username>@<server.edu>

skovaka commented 8 months ago

And I should be able take a look at your test dataset later this week! Thank you for sharing

cnk113 commented 8 months ago

Sorry I had one more request, could you perhaps have an example of trackplots with your test dataset?

Thanks!

skovaka commented 8 months ago

Could you also share the reference FASTA corresponding to your test dataset? I couldn't find anything suspicious from what you gave me, but can't run uncalled4 without the reference. I did notice that in your original post you said the warning was failed to open read d6f...., but there is no read starting with "d6f" in your test dataset. If you could share your exact error output for that test dataset that would be helpful

I also just pushed an update which will output a slightly more detailed error message for the "failed to open read" issue, which could help narrow it down. I also included a trackplot and dotplot example, along with bug fixes to those commands

cnk113 commented 8 months ago

Here is the ref file. ref.fa.tar.gz

The file that had the original I can't share due to it being corporate IP, sorry. The one shared is from a recent paper we are trying to reproduce that had the same issue essentially.

skovaka commented 7 months ago

Thanks again for sharing. It turns out the skipped reads and the segmentation fault are two separate errors:

The skipped reads were because I had not considered "split reads" recently introduced in Dorado, now helpfully fixed by @Han-ju https://github.com/skovaka/uncalled4/pull/8

The segfault appears to be caused by a bug in the latest version of Dorado (https://github.com/nanoporetech/dorado/issues/635). I just added an error check and a workaround in the latest commit. Now it will print warnings instead of crashing, and you can add the --zero-ts flag to correct the offending tag. It looks like an easy to fix issue for Dorado, so hopefully that flag will be unnecessary in future releases

cnk113 commented 7 months ago

Works!