hasindu2008 / slow5tools

Slow5tools is a toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format.
https://hasindu2008.github.io/slow5tools
MIT License
94 stars 6 forks source link

Could not create read group in fast5 file #109

Closed Theo-Nelson closed 3 months ago

Theo-Nelson commented 5 months ago

Dear slow5tools developers,

When working with the following blow5 file: https://github.com/Theo-Nelson/SMS-data/raw/main/HCV/HCV_IVT_004_ALIGNED_SIGNALS.blow5

I received the following error message:

[list_all_items] Looking for '*low5' files in /content/HCV_IVT_004_ALIGNED_SIGNALS.blow5 [s2f_main] 1 files found - took 0.000s [s2f_iop] 1 proceses will be used [s2f_child_worker] Converting /content/HCV_IVT_004_ALIGNED_SIGNALS.blow5 to fast5 [write_fast5::ERROR] Could not create read group in fast5 file 'HCV_IVT_004_reads_aligned.fast5'.

when running the command:

slow5tools s2f /content/HCV_IVT_004_ALIGNED_SIGNALS.blow5 -o HCV_IVT_004_reads_aligned.fast5

My slow5tools version was in conda with the following version: slow5tools-0.3.0. I then downloaded the latest binary release: slow5tools-v1.1.0-x86_64-linux-binaries.tar.gz and ran tar xf /content/slow5tools-v1.1.0-x86_64-linux-binaries.tar.gz

I then ran: /content/slow5tools-v1.1.0/slow5tools s2f /content/HCV_IVT_004_ALIGNED_SIGNALS.blow5 -o HCV_IVT_004_reads_aligned.fast5 with the same error message.

I also attempted to convert the blow5 to a slow5 file: /content/slow5tools-v1.1.0/slow5tools view /content/HCV_IVT_004_ALIGNED_SIGNALS.blow5 -o /content/HCV_IVT_004_ALIGNED_SIGNALS.slow5 and ran /content/slow5tools-v1.1.0/slow5tools s2f /content/HCV_IVT_004_ALIGNED_SIGNALS.slow5 -o HCV_IVT_004_reads_aligned.fast5 and replicated the same error.

For all of these runs, I noticed that the dorado basecaller would subsequently detect / process 769 signals.

Lastly, I split the file into individual signals: /content/slow5tools-v1.1.0/slow5tools split /content/HCV_IVT_004_ALIGNED_SIGNALS.blow5 -d /content/HCV_IVT_004_ALIGNED_SIGNALS -r 1 and converted each individual signal to a fast5 file:

%%bash

# Define the directory containing BLOW5 files
BLOW5_DIR="/content/HCV_IVT_004_ALIGNED_SIGNALS"
# Path to slow5tools
SLOW5TOOLS_PATH="/content/slow5tools-v1.1.0/slow5tools"

# Log file path
LOG_FILE="/content/conversion_log.txt"

# Navigate to the directory
cd "$BLOW5_DIR"

# Loop through each BLOW5 file in the directory
for blow5_file in *.blow5; do
    # Generate a unique slow5 filename by replacing the .blow5 extension with .slow5
    slow5_file="${blow5_file%.blow5}.slow5"

    # Print the slow5 file that is being processed
    echo "Processing $slow5_file"

    # Convert BLOW5 to SLOW5
    $SLOW5TOOLS_PATH view "$blow5_file" -o "$slow5_file" 

    # Check if the conversion was successful
    if [ -f "$slow5_file" ]; then
        # Convert SLOW5 to FAST5
        $SLOW5TOOLS_PATH s2f "$slow5_file" -o "${slow5_file%.slow5}.fast5" >>"$LOG_FILE" 2>&1
    else
        echo "Failed to convert $blow5_file to $slow5_file"
    fi
done

I did not notice any error messages in my converted log. The dorado basecaller processed 1791 signals from this folder.

conversion_log.txt

Not sure what might be happening with the bulk processing, but hopefully the above report is helpful. Please let me know if there is any additional information I could provide regarding this.

Theo-Nelson commented 5 months ago

P.S.: I ran slow5tools quickcheck /content/HCV_IVT_004_ALIGNED_SIGNALS.blow5 and slow5tools stats HCV_IVT_004_ALIGNED_SIGNALS.blow5 without issue

file version 0.2.0 file format BLOW5 record compression method zlib sigal compression method svb-zd number of read groups 1 number of auxiliary fields 6 auxiliary fields end_reason,channel_number,median_before,read_number,start_mux,start_time [stats_main] counting number of slow5 records... number of records 1803

[main] cmd: slow5tools stats HCV_IVT_004_ALIGNED_SIGNALS.blow5 [main] real time = 0.016 sec | CPU time = 0.022 sec | peak RAM = 0.106 GB

hiruna72 commented 5 months ago

Hello @Theo-Nelson,

This is happening because there are duplicate read ids in the blow5 file. Do you have a log how this file was created?

hiruna@hiruna-XPS-15-9500:~/Downloads$ slow5tools skim HCV_IVT_004_ALIGNED_SIGNALS.blow5 | cut -f 1 | sort -k1,1 | uniq -c | grep -v "1 "

[main] cmd: slow5tools skim HCV_IVT_004_ALIGNED_SIGNALS.blow5
[main] real time = 0.144 sec | CPU time = 0.875 sec | peak RAM = 0.090 GB
      2 1782d7b0-9740-4811-b307-3609cabf11d6
      2 230008b5-c717-451b-8212-ac0d71126adf
      2 318965b6-7038-41a4-ad08-067dd989b0a5
      2 4b24343e-7480-4795-be40-04f878db4792
      2 5bf9eccf-5f66-49e9-bb5d-3355ead10087
      2 649f4206-a100-4b20-b249-ce680b261f21
      2 824cc828-62d8-4dd2-bf04-5e1f0ab44344
      2 838793ab-7c84-42d4-b08f-73ed69e94ac2
      2 b4646af7-db92-438b-ad8e-c4fcbc43b544
      2 bc68de51-72df-4794-921d-426f52d2446e
      2 c8e3f51b-fe0c-452d-963d-06e1a890a53a
      2 f774dd72-54ae-45f5-b1e6-56f6b8351e4e
Theo-Nelson commented 5 months ago

Ah I see! The file was generated by first extracting the readIDs:

slow5tools skim --rid HCV_IVT_004.blow5 > HCV_IVT_004_full_blow5_readIDs.txt

taking minimap2 generated alignments for the same file to generate common read IDs that aligned to the HCV reference transcriptome: samtools view HCV_IVT_004.sorted.bam HCV_genome | awk '{print $1}' | grep -F -f HCV_IVT_004_full_blow5_readIDs.txt > HCV_IVT_common_readIDs.txt

and then using the following command

slow5tools get HCV_IVT_004.blow5 -l HCV_IVT_common_readIDs.txt -o HCV_IVT_reads_test_aligned.blow5

HCV_IVT_reads_test_aligned.blow5 was renamed to HCV_IVT_004_ALIGNED_SIGNALS.blow5

I've attached the common readIDs text file which seems to have duplicates for 1782d7b0-9740-4811-b307-3609cabf11d6

HCV_IVT_common_readIDs.txt

There is only one copy it seems in the HCV_IVT_004_full_blow5_readIDs.txt file, indicating that perhaps the appearance of a readID multiple times in the SAM file could be causing this issue. I will try to modify that command to only include unique readIDs

HCV_IVT_004_full_blow5_readIDs.txt

Thank you very much for your help!

hiruna72 commented 5 months ago

Cool. I suppose this command will work

samtools view $BAM | cut -f 1 | sort -k1,1 | uniq | slow5tools get  -o $BLOW5

to check if the $BLOW5 is formed well it is a good practice to create an index

slow5tools index $BLOW5
hasindu2008 commented 5 months ago

Thanks @hiruna72 for the help. sam files may have supplementary and secondary alignments and each read ID could be duplicated. As Hiruna suggested we will have to sort and remove the duplicates. You can use the command suggested by Hiruna or the one under "Extract and re-basecall reads mapping to a particular genomic region" at https://hasindu2008.github.io/slow5tools/workflows.html

hasindu2008 commented 3 months ago

@Theo-Nelson I hope your issue was resolved. Closing the issue for now. Please reopen if you are still having trouble.