hasindu2008 / f5c

Ultra-fast methylation calling and event alignment tool for nanopore sequencing data (supports CUDA acceleration)
https://hasindu2008.github.io/f5c/docs/overview
MIT License
144 stars 26 forks source link

Can f5c eventalign using nanopolish index? #145

Closed kir1to455 closed 12 months ago

kir1to455 commented 1 year ago

Hi, Thank you for developing f5c ! I have encountered some problems when I using f5c eventalign. I have finished nanopolish index, and I found that nanopolish eventalign was too slow. Can I use f5c eventalign for nanopolish index ? That way, I won't have to waste time running f5c index again.

Could you give me some advice? Best wishes, Kiritio

hasindu2008 commented 1 year ago

Yes, the nanopolish-generated index is compatible with the nanopolish index. You should be able to directly launch f5c eventalign on such created indexes. Let me know if it causes an error.

kir1to455 commented 1 year ago

Hi, it seems to have run successfully. image However, I noticed that some warnings in the eventalign. Is this normal ? image And I noticed that "GPU got too much work. Try increasing --ultra-thresh, decreasing --cuda-max-lf, decreasing --cuda-max-epk. Else, CPU is too powerful than GPU and just ignore". Should I increasing --iop? (My fast5 files have been split into separate parts) image

kir1to455 commented 1 year ago

Here is my code: ${f5c_dir}/f5c eventalign --reads ${fastq_dir}/vector.fq --bam ${Output_dir}/vector.sorted.bam --genome ${index_dir}/gencode.v43.transcripts.fa -t 30 --min-mapq 0 --secondary=no --rna --signal-index --scale-events -B 3M -K 1024 --iop 4 --cuda-dev-id 0 --summary ${Output_dir}/f5c_nanopolish.summary.txt | pigz > ${Output_dir}/f5c_vector.eventalign.tsv.gz

hasindu2008 commented 1 year ago

Hi,

I suggest using --iop 32 and -B 14M. If a warning/suggestion comes once or twice, it is fine. If something is coming continuously, it would be some parameters can be tuned for best performance.

kir1to455 commented 1 year ago

Hi, I have fininshed f5c eventalign. It is much faster than nanopolish. However, I have encountered some errors in the last step. image I don't know if it's normal to run eventalign? Or rather, I can ignore the errors here.

Best wishes, Kirito

hasindu2008 commented 1 year ago

Hi

This means that a corrupted FAST5 file was detected and the program terminated early. So it is not normal and may not have run to completion. FAST5 files are very troublesome, they are not just slow, but they also cause numerous headaches like this. Is this a publicly available dataset? If so I can inspect it a bit.

One way to clean up this dataset and yield much better performance would be by converting to BLOW5 format (which f5c supports) using slow5tools. The steps are under "Methylation calling or eventalignment using f5c" at https://hasindu2008.github.io/slow5tools/workflows.html. But given this seems to be a old dataset, some hurdles would be expected in that path too - but all solvable.

kir1to455 commented 1 year ago

Hi, @hasindu2008

Thanks for your reply! I have converted the FAST5 file to a single BLOW5 file. And I have encountered some errors when I using f5c eventalign. Here is my code: ${f5c_dir}/f5c eventalign --reads ${home_dir}/vector.fq --bam ${home_dir}/vector_V32.sorted.bam --genome ${index_dir}/gencode.v32.transcripts.fa --slow5 ${blow5_dir}/vector.blow5 -t 30 --min-mapq 0 --secondary=no --rna --signal-index --scale-events -B 14M -K 1024 --iop 32 --cuda-dev-id 0 --summary ${home_dir}/f5c_V32_nanopolish.summary.txt | pigz > ${home_dir}/f5c_V32_vector.eventalign.tsv.gz image But this doesn't seem to affect the results, can I ignore these error messages? Additionally, the Slow5 format is much better than the Fast5 format I expected.

Best wishes, Kirito

hasindu2008 commented 1 year ago

Hey,

These warnings mean that those readIDs could not be located and thus skipped. It will not affect the rest of the reads. As long as the number of such missing reads is not excessive, you can safely ignore this warning. At the end of the log, it should have some stats like the number of aligned reads and failed reads etc. Could you please copy-paste that part? It is very likely to be this read-splitting thing that is now default on ONT basecallers causing this, which would be around 10% of the data at max. Can you also tell me how you basecalled the data?

I have been wanting to write a way to handle these split reads - but ONT basecallers assign a completely random readID to reads that are split and there is no easy way to locate the original read based on the readID itself. There is this tag called parent_read_id in FASTQ that is sometimes preset and is helpful in locating those reads, however, that tag is pretty inconsistent and changes from basecaller to basecaller version (as usual for ONT software). Given that split-reads counts are not that significant currently, I haven't put effort into implementing a solution for this in f5c.

About FAST5, most of the effort in developing and maintaining f5c was on handling a zillion types of cases in FAST5, which is indeed unsustainable in the long-term, especially when developing and maintaining multiple tools. So one major benefit of S/BLOW5 is that (apart from performance, simplicity, reliability, size etc.), whenever nanopore changes something in FAST5/POD5, we just have to change the converter only, not a dozen different tools, scripts and pipelines.

kir1to455 commented 1 year ago

Hi, I have finished the f5c eventalign. I paste the summary below. image

And the ionic current data for each FAST5 file were subjected to basecalling using Guppy v4.2.2 with default parameters. Only reads in the pass folder were selected for subsequent analyses.

hasindu2008 commented 1 year ago

If the Guppy version is 4.2.2, it is unlikely to be the read splitting. But looking at the results, it is only 24720 bad reads out of 3956411 entries, which is only 0.625%. Given that the read count affected is extremely low, I would not worry about it. But if you would like me to, I could investigate if the original FAST5 files are available to be downloaded.

If you keep seeing this error consistently on other datasets, especially if that "bad reads" count gets considerable, please let me know, so I can investigate.

hasindu2008 commented 12 months ago

I would close this issue. If you have nay more questions, feel free to reopen.