Closed Theo-Nelson closed 3 months ago
P.S.: I ran slow5tools quickcheck /content/HCV_IVT_004_ALIGNED_SIGNALS.blow5
and slow5tools stats HCV_IVT_004_ALIGNED_SIGNALS.blow5
without issue
file version 0.2.0 file format BLOW5 record compression method zlib sigal compression method svb-zd number of read groups 1 number of auxiliary fields 6 auxiliary fields end_reason,channel_number,median_before,read_number,start_mux,start_time [stats_main] counting number of slow5 records... number of records 1803
[main] cmd: slow5tools stats HCV_IVT_004_ALIGNED_SIGNALS.blow5 [main] real time = 0.016 sec | CPU time = 0.022 sec | peak RAM = 0.106 GB
Hello @Theo-Nelson,
This is happening because there are duplicate read ids in the blow5 file. Do you have a log how this file was created?
hiruna@hiruna-XPS-15-9500:~/Downloads$ slow5tools skim HCV_IVT_004_ALIGNED_SIGNALS.blow5 | cut -f 1 | sort -k1,1 | uniq -c | grep -v "1 "
[main] cmd: slow5tools skim HCV_IVT_004_ALIGNED_SIGNALS.blow5
[main] real time = 0.144 sec | CPU time = 0.875 sec | peak RAM = 0.090 GB
2 1782d7b0-9740-4811-b307-3609cabf11d6
2 230008b5-c717-451b-8212-ac0d71126adf
2 318965b6-7038-41a4-ad08-067dd989b0a5
2 4b24343e-7480-4795-be40-04f878db4792
2 5bf9eccf-5f66-49e9-bb5d-3355ead10087
2 649f4206-a100-4b20-b249-ce680b261f21
2 824cc828-62d8-4dd2-bf04-5e1f0ab44344
2 838793ab-7c84-42d4-b08f-73ed69e94ac2
2 b4646af7-db92-438b-ad8e-c4fcbc43b544
2 bc68de51-72df-4794-921d-426f52d2446e
2 c8e3f51b-fe0c-452d-963d-06e1a890a53a
2 f774dd72-54ae-45f5-b1e6-56f6b8351e4e
Ah I see! The file was generated by first extracting the readIDs:
slow5tools skim --rid HCV_IVT_004.blow5 > HCV_IVT_004_full_blow5_readIDs.txt
taking minimap2 generated alignments for the same file to generate common read IDs that aligned to the HCV reference transcriptome: samtools view HCV_IVT_004.sorted.bam HCV_genome | awk '{print $1}' | grep -F -f HCV_IVT_004_full_blow5_readIDs.txt > HCV_IVT_common_readIDs.txt
and then using the following command
slow5tools get HCV_IVT_004.blow5 -l HCV_IVT_common_readIDs.txt -o HCV_IVT_reads_test_aligned.blow5
HCV_IVT_reads_test_aligned.blow5 was renamed to HCV_IVT_004_ALIGNED_SIGNALS.blow5
I've attached the common readIDs text file which seems to have duplicates for 1782d7b0-9740-4811-b307-3609cabf11d6
There is only one copy it seems in the HCV_IVT_004_full_blow5_readIDs.txt file, indicating that perhaps the appearance of a readID multiple times in the SAM file could be causing this issue. I will try to modify that command to only include unique readIDs
HCV_IVT_004_full_blow5_readIDs.txt
Thank you very much for your help!
Cool. I suppose this command will work
samtools view $BAM | cut -f 1 | sort -k1,1 | uniq | slow5tools get -o $BLOW5
to check if the $BLOW5 is formed well it is a good practice to create an index
slow5tools index $BLOW5
Thanks @hiruna72 for the help. sam files may have supplementary and secondary alignments and each read ID could be duplicated. As Hiruna suggested we will have to sort and remove the duplicates. You can use the command suggested by Hiruna or the one under "Extract and re-basecall reads mapping to a particular genomic region" at https://hasindu2008.github.io/slow5tools/workflows.html
@Theo-Nelson I hope your issue was resolved. Closing the issue for now. Please reopen if you are still having trouble.
Dear slow5tools developers,
When working with the following blow5 file: https://github.com/Theo-Nelson/SMS-data/raw/main/HCV/HCV_IVT_004_ALIGNED_SIGNALS.blow5
I received the following error message:
[list_all_items] Looking for '*low5' files in /content/HCV_IVT_004_ALIGNED_SIGNALS.blow5 [s2f_main] 1 files found - took 0.000s [s2f_iop] 1 proceses will be used [s2f_child_worker] Converting /content/HCV_IVT_004_ALIGNED_SIGNALS.blow5 to fast5 [write_fast5::ERROR] Could not create read group in fast5 file 'HCV_IVT_004_reads_aligned.fast5'.
when running the command:
slow5tools s2f /content/HCV_IVT_004_ALIGNED_SIGNALS.blow5 -o HCV_IVT_004_reads_aligned.fast5
My slow5tools version was in conda with the following version: slow5tools-0.3.0. I then downloaded the latest binary release:
slow5tools-v1.1.0-x86_64-linux-binaries.tar.gz
and rantar xf /content/slow5tools-v1.1.0-x86_64-linux-binaries.tar.gz
I then ran:
/content/slow5tools-v1.1.0/slow5tools s2f /content/HCV_IVT_004_ALIGNED_SIGNALS.blow5 -o HCV_IVT_004_reads_aligned.fast5
with the same error message.I also attempted to convert the blow5 to a slow5 file:
/content/slow5tools-v1.1.0/slow5tools view /content/HCV_IVT_004_ALIGNED_SIGNALS.blow5 -o /content/HCV_IVT_004_ALIGNED_SIGNALS.slow5
and ran/content/slow5tools-v1.1.0/slow5tools s2f /content/HCV_IVT_004_ALIGNED_SIGNALS.slow5 -o HCV_IVT_004_reads_aligned.fast5
and replicated the same error.For all of these runs, I noticed that the dorado basecaller would subsequently detect / process 769 signals.
Lastly, I split the file into individual signals:
/content/slow5tools-v1.1.0/slow5tools split /content/HCV_IVT_004_ALIGNED_SIGNALS.blow5 -d /content/HCV_IVT_004_ALIGNED_SIGNALS -r 1
and converted each individual signal to a fast5 file:I did not notice any error messages in my converted log. The dorado basecaller processed 1791 signals from this folder.
conversion_log.txt
Not sure what might be happening with the bulk processing, but hopefully the above report is helpful. Please let me know if there is any additional information I could provide regarding this.