hasindu2008 / f5c

Ultra-fast methylation calling and event alignment tool for nanopore sequencing data (supports CUDA acceleration)
https://hasindu2008.github.io/f5c/docs/overview
MIT License
140 stars 27 forks source link

Some issues when using 'f5c eventalign -- pore rna004' #181

Open llaomer opened 4 days ago

llaomer commented 4 days ago

Hi,

I am trying to run eventalign on reads from RNA004 and I get the following error: [meth_main::13.6310.41] 0 Entries (0.0M bases) loaded [pthread_processor::13.6310.41] 0 Entries (0.0M bases) processed [meth_main] skipped unmapped: 0, skipped secondary: 0, skipped low_mapq: 380450 [meth_main] total entries: 6763, qc fail: 0, could not calibrate: 0, no alignment: 0, bad reads: 6763 [meth_main] total bases: 0.0 Mbases [meth_main] Data loading time: 12.492 sec [meth_main] - bam load time: 1.736 sec [meth_main] - fasta load time: 4.313 sec [meth_main] - fast5 load time: 6.340 sec [meth_main] - fast5 open time: 6.340 sec [meth_main] - fast5 read time: 0.000 sec [meth_main] Data processing time: 0.000 sec [meth_main] Data output time: 0.000 sec [meth_main::INFO] Performance bounded by file I/O. File I/O took 12.491 sec than processing [meth_main::WARNING] Skipped 380450 reads with MAPQ < 20. Use --min-mapq to change the threshold. [meth_main::WARNING] 6763 out of 6763 reads missing in FAST5/SLOW5. [meth_main::ERROR] all 6763 out of 6763 reads failed. Check the input files.

This is the command : f5c eventalign -b align_sorted.bam -g mpox.fasta -r raw.fastq.gz --pore rna004 --rna > eventalign.tsv

My file is fast5 converted from pod5. The version of f5c is 1.5. Could you give me some advice? I am a novice in this field.

Best wishes, laomer

hasindu2008 commented 4 days ago

Hey, could you convert the pod5 files to blow5 using blue-crab [https://github.com/Psy-Fer/blue-crab] and see if you still getting the same error? Also Skipped 380450 reads with MAPQ is concerning - how did you basecall the reads (raw.fastq.gz) and map the reads (align_sorted.bam)?

llaomer commented 4 days ago

Sorry,my data is RNA data for R10.4, which is not supported by Blow5.

raw.fastq.gz(I merged the fastq-pass data generated by MinION) cat fastq_pass/*.fastq.gz > raw.fastq.gz

This is my command (align_sorted.bam): f5c index -d fast5 raw.fastq.gz minimap2 -ax map-ont -t 8 mpox.fasta raw.fastq.gz | samtools sort -o align_sorted.bam -T reads.tmp samtools index align_sorted.bam

hasindu2008 commented 3 days ago

BLOW5 supports any chemistry/flowcell combination.

Commands look allright, but could be something that happened during basecalling. If you can get the header metadata in the signals that could give us some clues.

If you have the data converted to blow5 format, you can use slow5tools skim --hdr reads.blow5 command to grab the header metadata.

llaomer commented 3 days ago

Hi,This is the result of the command:slow5tools skim --hdr reads.blow5 (f5c) zzr@zzr-VirtualBox:~/shares$ cat blow5/*.blow5 > reads.blow5 (f5c) zzr@zzr-VirtualBox:~/shares$ slow5tools skim --hdr reads.blow5

slow5_version 0.2.0

num_read_groups 1

@acquisition_id 5ead6516746fd7b2d57b13c49d6d13312cc018bc @acquisition_start_time 2024-07-01 22:06:32.425000+00:00 @adc_max 4095 @adc_min -4096 @asic_id 5097 @asic_id_eeprom 9473893 @asic_temp 546.896545 @asic_version IA02D @barcoding_enabled 0 @basecall_config_filename rna_rp4_130bps_hac_mk1c.cfg @configuration_version 5.9.18 @data_source real_device @device_id MC-110967 @device_type minion_mk1c @distribution_status stable @distribution_version 24.02.16 @exp_script_name sequencing/sequencing_MIN004RA_RNA:FLO-MIN004RA:SQK-RNA004:130 @exp_script_purpose sequencing_run @exp_start_time 2024-07-02T06:06:32.425856+08:00 @experiment_name 24-7-1 @experiment_type rna @flow_cell_id FAZ23462 @flow_cell_product_code FLO-MIN004RA @guppy_version 7.3.11+0112dde09 @heatsink_temp 33.988281 @host_product_code MIN-101C @host_product_serial_number MC-110967 @hostname MC-110967 @installation_type nc @is_simulated 0 @local_basecalling 1 @operating_system ubuntu 18.04 @package bream4 @package_version 7.9.8 @protocol_group_id 24-7-1 @protocol_name sequencing/sequencing_MIN004RA_RNA:FLO-MIN004RA:SQK-RNA004:130 @protocol_run_id aadb8975-9f13-48da-ac6d-3fe03446b6a4 @protocol_start_time 2024-07-02T06:05:52.213066+08:00 @protocols_version 7.9.8 @run_id 5ead6516746fd7b2d57b13c49d6d13312cc018bc @sample_frequency 4000 @sample_id no_sample @sample_rate 4000 @selected_speed_bases_per_second 130 @sequencer_position MC-110967 @sequencer_position_type MinION Mk1C @sequencing_kit sqk-rna004 @software MinKNOW 24.02.16 (Bream 7.9.8, Core 5.9.12, Dorado 7.3.11+0112dde09) @system_name MC-110967 @system_type MinION Mk1C @usb_config fx3_0.0.0#fpga_2.4.3#ctrl#USB0 @version 5.9.12

char uint32_t double double double double uint64_t int16_tchar* double int32_t uint8_t uint64_t enum{unknown,mux_change,unblock_mux_change,data_service_unblock_mux_change,signal_positive,signal_negative,api_request,device_data_error} float float float float uint32_t float uint64_t

read_id read_group digitisation offset range sampling_rate len_raw_signal raw_signal channel_number median_before read_number start_mux start_time end_reason tracked_scaling_shift tracked_scaling_scale predicted_scaling_shift predicted_scaling_scale num_reads_since_mux_change time_since_mux_change num_minknow_events

[main] cmd: slow5tools skim --hdr reads.blow5 [main] real time = 0.008 sec | CPU time = 0.022 sec | peak RAM = 0.007 GB

In addition, I will go through the process again using blow5. However, the following issues occurred during the indexing step (f5c) zzr@zzr-VirtualBox:~/shares$ f5c index --slow5 reads.blow5 raw.fastq.gz [slow5_idx_build::ERROR] Failed to allocate memory: Cannot allocate memory At src/slow5_idx.c:280 [sig_handler::ERROR] I regret to inform that a segmentation fault occurred. But at least it is better than a wrong answer.

hasindu2008 commented 3 days ago

Looking at the header, the basecalling metadata looks all right. Is this mpox.fasta the reference genome or the transcriptome?

About the blow5 error - Can I know the command you used to do the conversion and can you send the full log messages? Also how much RAM does your virtual-box VM has got? I am trying to see if the conversion ended abruptly due to out of memory.

Can you check if the slow5 file is intact by running and pasting the output of the following:

  1. slow5tools quickcheck reads.blow5
  2. slow5tools stats reads.blow5
  3. slow5tools index reads.blow5
llaomer commented 1 day ago

Yes, mpox.fasta is the reference genome.

The command of conversion: blue-crab p2s pod5_dir -d blow5_dir (f5c) zzr@zzr-VirtualBox:~/shares$ blue-crab p2s pod5_dir -d blow5_dir 17-Oct-24 15:56:09 - blue-crab - [INFO]: Creating directory: blow5_dir 17-Oct-24 15:56:09 - blue-crab - [INFO]: many2many: 73 pod5 files detected as input. Writing 1:1 pod5->s/blow5 to dir: blow5_dir 17-Oct-24 15:56:09 - blue-crab - [INFO]: writing s/blow5 to dir: blow5_dir 17-Oct-24 16:13:24 - blue-crab - [INFO]: pod5 -> s/blow5 complete

Virtual memory:64GB Actual memory allocation:62.18GB

(f5c) zzr@zzr-VirtualBox:~/shares$ slow5tools quickcheck reads.blow5 [main] cmd: slow5tools quickcheck reads.blow5 [main] real time = 0.004 sec | CPU time = 0.014 sec | peak RAM = 0.007 GB

(f5c) zzr@zzr-VirtualBox:~/shares$ slow5tools stats reads.blow5 file path reads.blow5 file version 0.2.0 file format BLOW5 record compression method zlib signal compression method svb-zd number of read groups 1 number of auxiliary fields 13 auxiliary fields channel_number,median_before,read_number,start_mux,start_time,end_reason,tracked_scaling_shift,tracked_scaling_scale,predicted_scaling_shift,predicted_scaling_scale,num_reads_since_mux_change,time_since_mux_change,num_minknow_events [stats_main] counting number of slow5 records... [slow5_get_next_mem::ERROR] Failed to allocate memory: Cannot allocate memory At src/slow5.c:3265 [stats_main::ERROR] Error reading the file.

(f5c) zzr@zzr-VirtualBox:~/shares$ slow5tools index reads.blow5 [slow5_idx_build::ERROR] Failed to allocate memory: Cannot allocate memory At src/slow5_idx.c:280 [segv_handler::ERROR] I regret to inform that a segmentation fault occurred. But at least it is better than a wrong answer. [src/main.c(segv_handler:105)::DEBUG] Here is the backtrace: /lib/x86_64-linux-gnu/libc.so.6(+0x1a0777)[0x7fbab07a0777] /lib/x86_64-linux-gnu/libc.so.6(+0x8b3e6)[0x7fbab068b3e6] /lib/x86_64-linux-gnu/libc.so.6(fread+0x79)[0x7fbab067fba9] slow5tools(+0x4be02)[0x559259966e02] slow5tools(slow5_idx_to+0x1e)[0x5592599684ee] slow5tools(slow5_idx_create+0x31)[0x55925995cdb1] slow5tools(_Z10index_mainiPPcP12program_meta+0x331)[0x5592599305d1] slow5tools(main+0x89d)[0x55925992725d] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fbab0629d90] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fbab0629e40] slow5tools(+0xc4c5)[0x5592599274c5]

hasindu2008 commented 1 day ago

Thanks, how did you generate this reads.blow5 from the blow5_dir - did you use slow5tools merge? Could you give me the command that you ran to merge and what log it printed?

llaomer commented 1 day ago

Thank you very much!Based on your suggestion, I merged the files using slow5tools. Everything went smoothly. But I have a question. This command_(f5c eventalign -b [reads.sorted.bam] -g [ref.fa] -r [reads.fastq|fasta] --slow5 [slow5_file] > [events.tsv])_is for S/BLOW5 (R9.4 DNA/RNA or R10.4 DNA data), is it valid for R10.4 RNA data?

hasindu2008 commented 17 hours ago

Glad to hear that it works with the merge. How did you create the reads.blow5 before, I am curious to know what caused that corrupted blow5 before.

Yeh, the command you mentioned is valid for all those cases. When using BLOW5 input, f5c will autodetect the model based on the metadata in the BLOW5 (as long as the POD5 has correct metadata). You can double-check by looking at the log messages from f5c:

Do you still get this all 6763 out of 6763 reads failed. Check the input files. when you use the S/BLOW5 as input? If so, highly likely that the BAM file / FASTQ file does not match the signal data, which I can tell you how to investigate.

llaomer commented 16 hours ago

I previously used this command (*cat blow5_dir/.blow5 > reads.blow5**) to generate reads.blow5. This time I used this command (slow5tools merge blow5_dir -o merged.blow5).

Yes, they are RP4/RNA004. Thank you very much for your detailed explanation.

When I used S/BLOW5 as input, the previous error did not occur. (f5c) zzr@zzr-VirtualBox:~/shares/m6anet/m6a$ f5c eventalign -b align_sorted.bam -g mpox.fasta -r raw.fastq.gz --slow5 merged.blow5 > eventalign.tsv [init_core::INFO] builtin RNA004 nucleotide model loaded [meth_main] skipped unmapped: 0, skipped secondary: 0, skipped low_mapq: 380450 [meth_main] total entries: 6763, qc fail: 1, could not calibrate: 750, no alignment: 63, bad reads: 0 [meth_main] total bases: 3.2 Mbases [meth_main] Data loading time: 47.400 sec [meth_main] - bam load time: 7.262 sec [meth_main] - fasta load time: 36.137 sec [meth_main] - slow5 load time: 3.959 sec [meth_main] Data processing time: 64.680 sec [meth_main] Data output time: 4.090 sec [meth_main::WARNING] Skipped 380450 reads with MAPQ < 20. Use --min-mapq to change the threshold. [main] Version: 1.5 [main] CMD: f5c eventalign -b align_sorted.bam -g mpox.fasta -r raw.fastq.gz --slow5 merged.blow5 [main] Real time: 68.790 sec; CPU time: 65.246 sec; Peak RAM: 0.334 GB

Thank you again. Thank you for your patience and assistance.

hasindu2008 commented 16 hours ago

Oh right, you cannot simply cat any type of files like that. You can cat plain text files or .gz files, but SAM,BAM,SLOW5,BLOW5,VCF,BCF, FAST5,POD5.... you should use the appropriate tools to combine them.

So it seems like POD5->FAST5 converter may have caused some issue, in anycase, using BLOW5 input is much efficient than FAST5, so simply convert POD5->BLOW5 rather than POD5>FAST5.

Skipped 380450 reads with MAPQ < 20. Use --min-mapq to change the threshold. this is a message that you can pay attention to. This means that the alignment (output from minimap2) had a lot of reads that are of very low mapping quality. If this is anticipated due to the quality of the reference or data, you can simply provide --min-mapq 0 to include everything including 0 quality alignments in the f5c output (or a threshold that is relevant to your problem). If such low-quality alignments are unexpected, suggest looking at the minimap2 flags, your reference and may be visualise through IGV to see if the mappings make sense.

From minimap2 getting started at https://github.com/lh3/minimap2?tab=readme-ov-file#getting-started, the suggested flags for direct-RNA mapping for a genome: ./minimap2 -ax splice -uf -k14 ref.fa reads.fa > aln.sam # noisy Nanopore Direct RNA-seq

llaomer commented 11 hours ago

Thank you for your help and guidance!

Yes, the warning message is anticipated,because my sequencing data is poor.I think I can only use these data to practice.

llaomer commented 10 hours ago

Excuse me again, have you encountered the same issue while using m6anet? I couldn't find the answer from the m6anet developer.If you haven't encountered the same problem, you don't need to answer me. When I further used m6anet for modification analysis, I encountered the following problem: 1.(m6anet) zzr@zzr-VirtualBox:~/shares/m6anet/m6a$ m6anet dataprep --eventalign eventalign.tsv --out_dir output --n_processes 4 I used the eventalign.tsv to run the dataprep.The run finished without error, but the data.log and data.json are empty.

2.(m6anet) zzr@zzr-VirtualBox:~/shares/m6anet/m6a$ m6anet inference --input_dir output --out_dir output2 --n_processes 4 --num_iterations 1000 IndexError: single positional indexer is out-of-bounds

Best wishes, laomer

hasindu2008 commented 10 hours ago

I ran m6anet on an RNA004 samples sometime ago using f5c v1.4 and the commands I used were (f5c v1.5 would behave similarly):

f5c index reads.fastq --slow5 reads.blow5
samtools sort minimap_out.sam -o minimap_out.bam 
samtools index minimap_out.bam
f5c eventalign --reads reads.fastq --bam minimap_out.bam -g gencode.v40.transcripts.fa --slow5 reads.blow5 --scale-events --signal-index --summary f5c_summary.tsv  --threads 10 -o f5c.tsv --rna --kmer-model f5c/test/rna004-models/rna004.nucleotide.5mer.model # download from https://raw.githubusercontent.com/hasindu2008/f5c/refs/heads/master/test/rna004-models/rna004.nucleotide.5mer.model
m6anet dataprep --eventalign f5c.tsv --out_dir m6anet/ --n_processes 8
m6anet inference --input_dir m6anet/ --out_dir m6anet/ --pretrained_model HEK293T_RNA004 --n_processes 8 --num_iterations 1000

That executed without a problem.