nanoporetech / ont_fast5_api

Oxford Nanopore Technologies fast5 API software
Other
144 stars 28 forks source link

Issues attempting to duplicate reads #61

Closed CodingKaiser closed 2 years ago

CodingKaiser commented 2 years ago

Hi there. I am currently trying to use the API to take reads within a fast5 file and duplicate each of them 10-fold. For this purpose, I have written a small script which loads in reads from a MultiFast5File, and for the desired number of duplicates, creates a new, random uuid, putting a copy of the read with the new uuid in a new file. Here is the code:


id_tree = dict()
with MultiFast5File(str(output_file), 'w') as output_f5:
    reads_present = set(output_f5.get_read_ids())

    for read_id, read in yield_fast5_reads(input_file, recursive=False):
        id_tree[read_id] = []

        dup_read_id = read_id
        for _ in range(dup_num):
            while dup_read_id == read_id or dup_read_id in reads_present:
                dup_read_id = str(uuid4())
            read.read_id = dup_read_id
            output_f5.add_existing_read(read, target_compression=target_compression)
            reads_present.add(dup_read_id)
            id_tree[read_id].append(dup_read_id)

Because I want files with batches of 4000, I then run multi_to_single_fast5.py on the resulting files, followed by single_to_multi_fast5.py. However, this last step throws many error messages, all of the form:

ERROR:ont_fast5_api.conversion_tools.single_to_multi_fast5:Unable to create group (name already exists)
    Failed to add single read file: './hp_fast5_split/6/a26bf700-6d3d-4df0-a7ad-39a34f424e58.fast5' to './hp_fast5_batch/batch_3.fast5'

As far as I understand it, each time add_existing_read is called, a new group name should be created using the new read_id. Based on this error message, this doesn't seem to be happening, however. While I could just adjust my script to do the batching myself, the error message makes me unsure of whether I am correctly making copies of the individual reads.

Thanks for your help.

fbrennen commented 2 years ago

Hi @CodingKaiser -- thanks for getting in touch. Could we have a look at one of the multi-read files your script produced (before you called multi_to_single or single_to_multi on it)?

CodingKaiser commented 2 years ago

Hi @fbrennen, sure thing and thanks for the speedy response. Here a sample file.

I started with a file of 4000 sequences, selected the first 100 sequences, and duplicated them 10-fold to yield 1000 total sequences.

hp_reads0.upsample.fast5.zip

In the meantime I ran multi_to_single on this smaller file and discovered that I get more sequences out than I started with (5000), which I do not understand. It would appear to be a sum of the number of sequences in the file (1000) + the number of sequences in the original file...

fbrennen commented 2 years ago

Hi @CodingKaiser -- thank you very much for the file. The issue here is using add_existing_read() -- that copies the data from the original file, including the part of the data that stores the (original, unmodified) read_id. Try something like this instead:

for read_id, read in yield_fast5_reads(input_file, recursive=False):
    raw_attrs = read._get_attributes(read.raw_dataset_group_name)
    raw_data = read.get_raw_data()
    channel_info = read.get_channel_info()
    tracking_id = read.get_tracking_id()
    context_tags = read.get_context_tags()

    for _ in range(dup_num):
        new_id = str(uuid.uuid4())
        raw_attrs['read_id'] = new_id

        new_read = outfile_f5.create_empty_read(new_id, read.run_id)
        new_read.add_raw_data(raw_data, attrs=raw_attrs)
        new_read.add_channel_info(channel_info)
        new_read.add_tracking_id(tracking_id)
        new_read.add_context_tags(context_tags)
CodingKaiser commented 2 years ago

Hi @fbrennen, thanks for the detailed example. Using the code sadly only partially fixes the problem on my test data. While the correct number of single sequences are now output when using multi_to_single, I am getting the same type of error as before when trying to run single_to_multi:

ERROR:ont_fast5_api.conversion_tools.single_to_multi_fast5:Unable to create group (name already exists)
    Failed to add single read file: './hp_fast5_split/0/9b37a50e-4df5-4a2f-bc9f-7a044bb9661d.fast5' to './hp_fast5_batch/batch_0.fast5

(repeated for every sequence, even those that end up in the final file)

with the resulting mutil-file consisting of the same number of reads I started with, I would assume because every duplicated sequence shares a group name and thus cannot be added?

fbrennen commented 2 years ago

Hi @CodingKaiser -- can you try starting from scratch, with new folders and a new set of input reads? I'm able to run the code I provided you above, create a set of duplicated reads, and then put them through multi_to_single and single_to_multi.

CodingKaiser commented 2 years ago

You're right, it's working. Not sure why it wasn't before, but it seems to be an issue with the specifics of my script, then. Thanks for all the help.

fbrennen commented 2 years ago

Good to hear! Glad you've got it working.

CodingKaiser commented 2 years ago

Just figured it out. It was just a copy-paste error which left out the important raw_attrs['read_id'] = new_id line 👍