hasindu2008 / f5c

Ultra-fast methylation calling and event alignment tool for nanopore sequencing data (supports CUDA acceleration)
https://hasindu2008.github.io/f5c/docs/overview
MIT License
144 stars 26 forks source link

f5c about kernel:NMI watchdog: BUG: soft lockup #185

Open kir1to455 opened 2 weeks ago

kir1to455 commented 2 weeks ago

Hi, @hasindu2008

When I use f5c to eventalign RNA004 data, the system popped up this bug and crashed. Here are a few screenshots.

image

Here is my code: for i in input; do echo $i mkdir -p $blow5_dir/${i} mkdir -p $Event_dir/${i} blue-crab p2s ${Pod5Dir}/${i}/pod5_pass -d $blow5_dir/${i} -t 30 -p 10 ### pod5 to slow5 slow5tools merge $blow5_dir/${i} -o ${blow5_dir}/${i}_sup.pass.blow5 -t 30 Bamfiles=(find $BamDir/${i} -name "${i}_merge_sup_chr*.sorted.bam") for bfile in ${Bamfiles[@]}; do echo $bfile bbase=basename $bfile .sorted.bam ${f5c_dir}/f5c index -t 40 ${FastqDir}/${i}/${bbase}.fastq --slow5 ${blow5_dir}/${i}_sup.merge.blow5 ${f5c_dir}/f5c eventalign --reads ${FastqDir}/${i}/${bbase}.fastq --bam $bfile --genome ${index_dir}/gencode.vM33.normal.transcripts.fa --slow5 ${blow5_dir}/${i}_sup.merge.blow5 -t 30 --kmer-model ${f5c_dir}/test/rna004-models/rna004.nucleotide.5mer.model --min-mapq 0 --secondary=no --rna --signal-index --scale-events --collapse-events --samples -B 14M -K 1024 --cuda-dev-id 0 --summary ${Event_dir}/${i}/${bbase}_nanopolish.summary.txt | pigz > ${Event_dir}/${i}/${bbase}.eventalign.tsv.gz ## --samples raw events --collapse-events done done ` I index the fastq file of the chromosome with merge blow5 before running f5c eventalign each time.

And my log file, I found the some reads were not found in file.

image

Is it not possible to use the split fastq files(chr1...chr2...chr3...) to index the integrated blow5 files?

Best wishes, Kirito

hasindu2008 commented 2 weeks ago

Hello

What is the specification of your system? This kernel error message is unlikely to have anything to do with f5c. I have seen this error a few times on single-board computers running unstable operating systems when the system load is high.

Is there a reason you are using loops in your script? Are these multiple samples that you are iterating using i? If it is a single sample, what about

  1. merging all the BLOW5 files into one BLOW5
  2. combining all FASTQ into one file
  3. merging and sorting all BAM files into one file
  4. Then running a single f5c index followed by an f5c eventalign?

Again, is there a reason why you are going over chromosomes individually, rather than doing it at one?

If you want to keep the loop approach, I suggest at least the following.

  1. create one single BLOW5
  2. create one single FASTQ file
  3. Do the f5c index once
  4. Then iterate through the BAM files while calling f5c eventalign
kir1to455 commented 2 weeks ago

Hi, @hasindu2008

My system is CentOS Linux 7. I only have two samples : input and ip.

merging all the BLOW5 files into one BLOW5 combining all FASTQ into one file merging and sorting all BAM files into one file Then running a single f5c index followed by an f5c eventalign

In fact, I have always used the first method and no bug. Excellent!

Again, is there a reason why you are going over chromosomes individually, rather than doing it at one?

The reason why I want to split the chromosome is because the eventalign file of RNA004 is too large (nearly 1T) , and I want to split it according to the chromosome. To split the chromosome will help me in machine learning.

The approach 1 is use split chromosome (like chr1.fq) mapping to chr1.bam, then run index with merge BLOW5 and run f5c eventalign.

The approach 2 is use whole.fq mapping to whole.bam and run f5c eventalign, then split the whole eventalign file to chrom.eventalign.tsv.

It looked like the first method might take less time, so I tried the first one. I'll try the second approach.

Best wishes, Kirito

hasindu2008 commented 2 weeks ago

Hello,

Sorry, it wasn't very clear to me and I could have misunderstood what you told me.

  1. Are you saying that the "NMI" system error comes only when you run the code with loops you shared and not when you run without loops on a merged file?
  2. To get chromosomes separately, you do not have to complicate it by splitting FASTQ files. You might have unintentionally introduced a bug when splitting and this could be why you are getting that error "some reads were not found in the file". You can simply rely on one single FASTQ, BLOW and BAM file and can use the -w option in f5c to provide a region. You could use this -w to loop through (let us call this approach 3)
    • create a single merged BLOW5 file
    • create a single merged FASTQ file
    • create a single merged sorted BAM file
    • Run f5c index on the single merged FASTQ and BLOW5 file
    • then in a loop call f5c eventalign with -w: e.g.:
      for chr in chr1 chr2 chr3 chr 4
      do
      f5c eventalign --reads merged.fastq --bam merged.bam -genome gencode.vM33.normal.transcripts.fa --slow5 merged.blow5 -t 30 --kmer-model rna004.nucleotide.5mer.model --min-mapq 0 --secondary=no --rna --signal-index --scale-events --collapse-events --samples -B 14M -K 1024 --cuda-dev-id 0 --summary ${chr}_nanopolish.summary.txt -w ${chr} | pigz > ${chr}.eventalign.tsv.gz
      done
kir1to455 commented 2 weeks ago

Hi, @hasindu2008

Are you saying that the "NMI" system error comes only when you run the code with loops you shared and not when you run without loops on a merged file?

Yes, when I split chromosome to chr1...chrM, not all chromosomes will have eventalign errors. Sometimes may be Chr5 or Chr 7 and so on will popped up this bug and crashed. I think this may be because the f5c thread is not finding the read id when looking for it, causing the driver to soft lock.

Approach 3 is a nice approach. However, my fastq files were mapping to transcriptome alignments. I split the bam file of the transcriptome according to the correspondence between the gencode transcripts and chromosomes. -w option should only be used for genome alignments.

hasindu2008 commented 2 weeks ago

Ah, I see. But how it would be possible that the read ID is in the BAM file, but not in the corresponding FASTQ file? There are cases when a read is in the FASTQ, but not in BAM. However, if you aligned a FASTQ file that created a BAM, there should be NO new readIDs in the BAM. How did you generate these BAM files?

kir1to455 commented 2 weeks ago

Hi, @hasindu2008

I used minimap2 to align fastq mapping to the transcriptome bam file.

Then I split the transcriptome bam file into chr.sorted.bam.

Final, I used samtools fastq to convert the chr.sorted.bam files into chr.fastq files and run index with merge.blow5.

Here is my code : ${DoradoDir}/dorado aligner --mm2-opts "-k 15 -w 10 --secondary=no" -t 50 ${indexDir}/gencode.vM33.normal.transcripts.fa ${FastqDir}/input_merge_sup.fastq > ${BamDir}/input_merge.vM33.transcriptome.dorado.bam samtools sort -@ 40 ${BamDir}/input_merge.vM33.transcriptome.dorado.bam > ${BamDir}/input_merge.vM33.transcriptome.dorado.sorted.bam

python /public2/hjliang/Nanopore/script/transbam_to_chr_bam.py ${bamdir}/input_merge.vM33.transcriptome.dorado.sorted.bam Bamfiles=(find ${split_bamdir}/input -name "input_merge_sup_chr*.bam") for bfile in ${Bamfiles[@]}; do echo $bfile chr=basename $bfile .sorted.bam samtools sort -@ 30 $bfile -o ${chr}.sorted.bam samtools index -@ 30 ${chr}.sorted.bam samtools fastq -@ 40 ${chr}.sorted.bam > ${chr}.fastq ${f5c_dir}/f5c index -t 40 ${chr}.fastq --slow5 ${blow5_dir}/input_sup.merge.blow5 done

Does samtools fastq here generate a different read id from the bam file?

hasindu2008 commented 2 weeks ago

samtools fastq does not mess up the readIDs, but the use of samtools fastq here is not suitable as there could be supplementary alignments that can make duplicate entries in FASTQ, and also hard clipping-related issues. You can instead give the merged FASTQ file to f5c. Also, in the above commands, I do not see a minimap2 command. Could you try using actual minimap2 rather than dorado-aligner and f5c on the whole FASTQ file in the following fashion?

minimap2 -ax map-ont --secondary=no -t 50 -uf gencode.vM33.normal.transcripts.fa  input_merge_sup.fastq  | samtools sort -@ 40 - -o input_merge.vM33.transcriptome.bam
f5c index -t 40 input_merge_sup.fastq--slow5 input_sup.merge.blow5

python transbam_to_chr_bam.py input_merge.vM33.transcriptome.dorado.sorted.bam
Bamfiles=(find ${split_bamdir}/input -name  "input_merge_sup_chr*.bam")
for bfile in ${Bamfiles[@]}; do
  echo $bfile
  chr=basename $bfile .sorted.bam
  samtools sort -@ 30 $bfile -o ${chr}.sorted.bam 
  samtools index -@ 30 ${chr}.sorted.bam
  f5c eventalign --reads input_merge_sup.fastq --bam ${chr}.sorted.bam  --genome gencode.vM33.normal.transcripts.fa --slow5 input_sup.merge.blow5 -t 30 --kmer-model rna004.nucleotide.5mer.model --min-mapq 0 --secondary=no --rna --signal-index --scale-events --collapse-events --samples -B 14M -K 1024 --cuda-dev-id 0 --summary ${chr}_nanopolish.summary.txt | pigz > ${chr}.eventalign.tsv.gz 
done