nanoporetech / ont_fast5_api

Oxford Nanopore Technologies fast5 API software
Other
149 stars 29 forks source link

Write reads to a multi fast5? #19

Closed mrvollger closed 5 years ago

mrvollger commented 5 years ago

I am looking to use your API to take an FOFN (file of file names) of fast5 files and then turn them into a single multifast5.

From you examples I can figure out how to iterate over the reads, but I am unsure how to write them to a multifast5. I tried reading the source code but I have not been able to figure out how reads are written to a fast5.

Any help would be appreciated. Below is some of the code I have been playing with:

#!/usr/bin/env python
from ont_fast5_api import __version__
from ont_fast5_api.conversion_tools.conversion_utils import get_fast5_file_list, batcher, get_progress_bar
from ont_fast5_api.fast5_file import Fast5File
from ont_fast5_api.multi_fast5 import MultiFast5File
from ont_fast5_api.fast5_interface import get_fast5_file
import sys
import os
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-i","--fofn", help="input fofn of fast5s")
parser.add_argument("-o", "--out", help="out fast5")
args = parser.parse_args()

seen = set()
with MultiFast5File(args.out, 'a') as multi_f5:
        for idx, fast5 in enumerate(open(args.fofn)):
                fast5=fast5.strip()
                with get_fast5_file(fast5, mode="r") as f5:
                        for read_id in f5.get_read_ids():
                                if(read_id in seen):
                                        continue

                                #SOME HOW WRITE READ TO "multi_f5"

                                #read_name = "read_" + read_id
                                #group = f5.handle[read_name]
                                #multi_f5.handle.copy(group, read_name)

                                seen.add(read_id)

                sys.stderr.write("\rDone with {} fast5 files.format(idx))
sys.stderr.write("\n")
wdecoster commented 5 years ago

Isn't that what fast5_subset does? https://github.com/nanoporetech/ont_fast5_api/blob/master/README.rst#fast5_subset

mrvollger commented 5 years ago

Not exactly, I have a list of fast5s (not a list of read ids) that I want to combine into a single fast5, and some of those fast5s may be single_fast5s and others may be multi_fast5s. Unless I am misunderstanding what subset can do.

fbrennen commented 5 years ago

fast5_subset almost does what you're after -- it takes a list of read_ids and a folder of fast5 files (either single or multi, or a mix), and then extracts the appropriate reads from them into a new set of multi-read fast5 files (with any specific number of reads per file, so you could just put a giant number here if you want to concatenate everything together). So if you're happy to start by compiling a list of all the read_ids in the files you've got then you could jury-rig fast5_subset to do this:

fast5_subset -i <input_folder> -s <output_folder> --batch_size 9999999 [etc]

If you do want to write your own tool, then I'd suggest looking at the bits of mulit_fast5_subset that copy reads from files -- you've already done most of the work getting the read_ids:

https://github.com/nanoporetech/ont_fast5_api/blob/master/ont_fast5_api/conversion_tools/multi_fast5_subset.py#L217

We've got an update to the project that should come out today (v1.4.7) that splits out a few of these steps and is a bit easier to read.

mrvollger commented 5 years ago

Great, I will check it out once it has been updated. Thanks!

Also when I try to click that link I get a 404 error. Is there another file you meant to reference, or will this file be part of the new release?

fbrennen commented 5 years ago

Ok, 1.4.7 is up (that ruined the above link, sorry! I have swapped to correctly using permalinks) -- here is the bit of fast5_subset that copies reads into multi-read files:

https://github.com/nanoporetech/ont_fast5_api/blob/0b40b10e33b91b5e24202023cc9b4826093e4b38/ont_fast5_api/conversion_tools/fast5_subset.py#L226

And the bit you're after is probably going to be in read_generator() (which takes those read_ids and turns them into Fast5Read objects):

https://github.com/nanoporetech/ont_fast5_api/blob/0b40b10e33b91b5e24202023cc9b4826093e4b38/ont_fast5_api/conversion_tools/fast5_subset.py#L257

Give it a try and let me know how it goes.

behrimg commented 5 years ago

I'm trying to use either the fast5_subset or single_to_multi_fast5 and I keep getting this error anytime I try and run one of the console scripts:

ValueError: mpi4py.MPI.Status size changed, may indicate binary incompatibility. Expected 48 from C header, got 40 from PyObject

these are the versions of my dependancies...do i need to be running a specific version of openmpi?

ont-fast5-api (1.4.7) progressbar33 (2.4) six (1.12.0) numpy (1.17.0) h5py (2.9.0)

mrvollger commented 5 years ago

@fbrennen thanks this looks like it should be enough to get me started. If I get a working code snippet I will post it. For now I think I can close this.