Closed CodingKaiser closed 2 years ago
Hi @CodingKaiser -- thanks for getting in touch. Could we have a look at one of the multi-read files your script produced (before you called multi_to_single
or single_to_multi
on it)?
Hi @fbrennen, sure thing and thanks for the speedy response. Here a sample file.
I started with a file of 4000 sequences, selected the first 100 sequences, and duplicated them 10-fold to yield 1000 total sequences.
In the meantime I ran multi_to_single
on this smaller file and discovered that I get more sequences out than I started with (5000), which I do not understand. It would appear to be a sum of the number of sequences in the file (1000) + the number of sequences in the original file...
Hi @CodingKaiser -- thank you very much for the file. The issue here is using add_existing_read()
-- that copies the data from the original file, including the part of the data that stores the (original, unmodified) read_id
. Try something like this instead:
for read_id, read in yield_fast5_reads(input_file, recursive=False):
raw_attrs = read._get_attributes(read.raw_dataset_group_name)
raw_data = read.get_raw_data()
channel_info = read.get_channel_info()
tracking_id = read.get_tracking_id()
context_tags = read.get_context_tags()
for _ in range(dup_num):
new_id = str(uuid.uuid4())
raw_attrs['read_id'] = new_id
new_read = outfile_f5.create_empty_read(new_id, read.run_id)
new_read.add_raw_data(raw_data, attrs=raw_attrs)
new_read.add_channel_info(channel_info)
new_read.add_tracking_id(tracking_id)
new_read.add_context_tags(context_tags)
Hi @fbrennen, thanks for the detailed example. Using the code sadly only partially fixes the problem on my test data. While the correct number of single sequences are now output when using multi_to_single
, I am getting the same type of error as before when trying to run single_to_multi
:
ERROR:ont_fast5_api.conversion_tools.single_to_multi_fast5:Unable to create group (name already exists)
Failed to add single read file: './hp_fast5_split/0/9b37a50e-4df5-4a2f-bc9f-7a044bb9661d.fast5' to './hp_fast5_batch/batch_0.fast5
(repeated for every sequence, even those that end up in the final file)
with the resulting mutil-file consisting of the same number of reads I started with, I would assume because every duplicated sequence shares a group name and thus cannot be added?
Hi @CodingKaiser -- can you try starting from scratch, with new folders and a new set of input reads? I'm able to run the code I provided you above, create a set of duplicated reads, and then put them through multi_to_single
and single_to_multi
.
You're right, it's working. Not sure why it wasn't before, but it seems to be an issue with the specifics of my script, then. Thanks for all the help.
Good to hear! Glad you've got it working.
Just figured it out. It was just a copy-paste error which left out the important raw_attrs['read_id'] = new_id
line 👍
Hi there. I am currently trying to use the API to take reads within a fast5 file and duplicate each of them 10-fold. For this purpose, I have written a small script which loads in reads from a MultiFast5File, and for the desired number of duplicates, creates a new, random uuid, putting a copy of the read with the new uuid in a new file. Here is the code:
Because I want files with batches of 4000, I then run multi_to_single_fast5.py on the resulting files, followed by single_to_multi_fast5.py. However, this last step throws many error messages, all of the form:
As far as I understand it, each time
add_existing_read
is called, a new group name should be created using the new read_id. Based on this error message, this doesn't seem to be happening, however. While I could just adjust my script to do the batching myself, the error message makes me unsure of whether I am correctly making copies of the individual reads.Thanks for your help.