Ensuring File Integrity and Accuracy in Processing Custom Genome blow5 Files

KDalcin commented 1 month ago

Hello,

I apologise if this is not the right location to make this type of enquiry. I am completing my master's in diagnostics genomics and I have some concerns and questions that I wanted to reach out about.

I am working on a project involving a custom human reference genome into which a whole virus genome has been inserted at known arbitrary locations. This custom genome was then used with squigulator to produce blow5 files, simulating nanopore reads.

Planned Workflow:

Split the blow5 files into smaller segments containing 4000 reads each using slow5tools split. Convert each segment from blow5 to fast5 using slow5tools. Basecall using the EPI2ME workflow with Dorado to a generate fastq file.

Concerns:

File Integrity: There are multiple file transformations in this workflow. I am concerned about the potential for data loss or corruption during file splitting and conversion. Accuracy of fastq Files: Given the complexity of the custom genome (human + virus), ensuring that the basecalling accurately reflects the original sequence data is crucial.

Questions:

Is this approach problematic for file integrity? Are there recommended practices or additional precautions I should take to ensure data integrity throughout these transformations?

Thank you kindly for any thoughts or assistance you can provide, Kara

hasindu2008 commented 1 month ago

Hello,

splitting and conversion have been tested over many datasets and are stable. As long as you do not ignore any warnings or errors during the process, files are not going to be corrupted.

We have not used EPI2ME, instead we simply use https://github.com/Psy-Fer/buttery-eel which is a wrapper to ONT dorado-server that can directly read SLOW5 files. This way you can avoid any conversions if you wish.

Can I know what you are planning to do with the basecalled output for the BLOW5 files from squigulator? Rather than file integrity, I would be more concerned about the differences between simulated data and real-data. Squigulator simulates nanopore data, but note that simulated data is never going to be identical to real-data. Squigulator's intended use case is to initially test/debug your pipelines and once those are sorted, you should test with real-data, especially for diagnostic purposes. Squigulator simulated data can be used to assist the development of your workflows and testing, but it is not meant to replace real-data. What are you planning to do with the basecalled output in the downstream analysis?

Psy-Fer commented 1 month ago

Also just to add if you are still going to convert the data to a format to work with epi2me might I suggest converting to pod5 rather than fast5 as the performance and data handling will be a lot smoother (and faster). We still recommend slow5 over both of those formats of course.

To convert to pod5 from slow5 you can use a tool I wrote called blue-crab https://github.com/Psy-Fer/blue-crab

That should handle the conversion for you.

Getting some more information on what you are trying to achieve will allow Hasindu and I to help you get what you need.

James

KDalcin commented 1 month ago

Hello,

Thank you for the detailed responses and the insights into file handling and data integrity. It's reassuring to know that this idea should work. I will ensure that all warnings and errors are carefully monitored during these steps.

I appreciate the recommendation to use the buttery-eel wrapper to work directly with SLOW5 files. Also, thank you for suggesting the conversion to pod5 using blue-crab for enhanced performance. I will look into this as an option to optimise the data handling within my pipeline.

Regarding the use of simulated data from Squigulator, the intent behind this approach is strictly for the development and initial testing of my pipeline. Once the pipeline is robust and functioning as intended, I plan to validate it with real sequencing data to ensure its applicability and accuracy in a diagnostic context.

Your feedback has been immensely helpful, and I'm grateful for the guidance provided.

All the best,

Kara

Psy-Fer commented 1 month ago

Hey Kara,

You can get most of the info about butter-eel in the readme https://github.com/Psy-Fer/buttery-eel?tab=readme-ov-file#usage

There are some "hidden" arguments that I try to cover in the docs that are just commands for guppy/dorado-server. Basically, any arguments that are not in buttery-eel will be passed along to dorado-server to see if they are arguments of that. --use_tcp is one of these for example, and you most likely want to use it.

If you have any troubles, please create an issue over there and I'll help you out. Same goes for blue-crab for conversions.

Cheers, James

hasindu2008 commented 1 week ago

CLosing this issue for now. If you are having any more questions, feel free to open a new issue.

hasindu2008 / slow5tools

Ensuring File Integrity and Accuracy in Processing Custom Genome blow5 Files #111