hasindu2008 / f5c

Ultra-fast methylation calling and event alignment tool for nanopore sequencing data (supports CUDA acceleration)
https://hasindu2008.github.io/f5c/docs/overview
MIT License
144 stars 26 forks source link

f5c error for m6anet #154

Closed josiegleeson closed 8 months ago

josiegleeson commented 9 months ago

Hi Hasindu,

I'm trying to use the latest version of m6anet for the new RNA004 kits. I've been following their instructions in the manual here: https://m6anet.readthedocs.io/en/latest/release_notes.html

However, when I am trying to run m6anet's inference step, some lines in my f5c eventalign output are causing an error as follows: pandas.errors.ParserError: Error tokenizing data. C error: Expected 15 fields in line 7517322566, saw 17

I am running f5c (with CPUs) as follows: f5c eventalign --rna -b transcriptomic_aln.bam \ -r pass.fastq \ -g gencode.v31.transcripts.fa \ -o eventalign.tsv \ --kmer-model rna004.nucleotide.5mer.model \ -x hpc \ -t 24 \ --slow5 merged.blow5 \ --signal-index \ --scale-events

There were 4 lines in my eventalign file that look strange:

Line 1: ENST00000269142.9|ENSG00000141384.12|OTTHUMG00000179429.3|OTTHUMT00000446260.3|TAF4B-201|TAF4B|5260|protein_coding| 589 TCENST00000269142.9|ENSG00000141384.12|OTTHUMG00000179429.3|OTTHUMT00000446260.3|TAF4B-201|TAF4B|5260|protein_coding| 561 CGCGT 18931981 t 11 83.32 6.641 0.00225 CGCGT 76.30 3.00 1.80 125102 125111 Line 2: ENST00000269142.9|ENSG00000141384.12|OTTHUMG00000179429.3|OTTHUMT00000446260.3|TAF4B-201|TAF4B|5260|protein_coding| 590 CGENST00000269142.9|ENSG00000141384.12|OTTHUMG00000179429.3|OTTHUMT00000446260.3|TAF4B-201|TAF4B|5260|protein_coding| 615 GAGCA 18931982 t 13 103.21 2.427 0.00525 GAGCA 99.77 3.00 0.87 113992 114013 Line 3: ENST00000579973.5|ENSG00000134504.13|OTTHUMG00000131947.6|OTTHUMT00000254906.2|KCTD1-206|KCTD1|1747|protein_coding| 159 CTAGA 18932128 t 47 103.3ENST00000579973.5|ENSG00000134504.13|OTTHUMG00000131947.6|OTTHUMT00000254906.2|KCTD1-206|KCTD1|1747|protein_coding| 125 GACAG 18932132 t 2 79.81 1.487 0.00375 GACAG 75.65 3.00 1.04 87063 87078 Line 4: ENST00000579973.5|ENSG00000134504.13|OTTHUMG00000131947.6|OTTHUMT00000254906.2|KCTD1-206|KCTD1|1747|protein_coding| 126 ACAGT 18932137 t 49 99.37 5.052 0.00ENST00000579973.5|ENSG00000134504.13|OTTHUMG00000131947.6|OTTHUMT00000254906.2|KCTD1-206|KCTD1|1747|protein_coding| 126 ACAGT 18932139 t 12 99.60 8.197 0.00275 ACAGT 104.38 3.00 -1.17 87680 87691

Its like the lines should have continued but the next lines started writing in weird places. For example, in the first two lines above, the kmer gets cutoff after two nucleotides and the next transcript id begins.

In the third line, the event_level_mean column contains the next lines transcript id, and in the fourth line the event_length column contains the next lines transcript id. For now I can try and manually remove these lines, but considering my eventalign file is ~2TB, this is quite time consuming and I have to run it only many samples.

Thanks for any help, much appreciated!

hasindu2008 commented 9 months ago

Hi @josiegleeson This is a very strange potential bug that I have never seen before in f5c. Can I know which version of f5c you are using?

josiegleeson commented 9 months ago

Sure, its 1.3. Built in a conda env. My eventalign contains around 10 million reads so maybe its a memory issue or something to do with the HPC I am using as well...

Out of curiosity, because these files get so big for m6A calling, is it possible to only output the eventalign for a list of kmers? Considering m6A is only found in DRACH motifs and m6anet only calls m6A within these kmers anyway. Just a thought to reduce file size...

Thanks for your help

hasindu2008 commented 9 months ago

10M reads is a typical promethION run, so this should be a bug. I will try to reproduce this issue by executing on large datasets like that.

Yes, there are ways to filter out kmers by using bash commands. Create a list of kmers you need into a file for instance called kmers.txt:

ACGTA
AAAAA
TATCA

Then say you have the large f5c event align file called events.tsv. First, extract the header, and then use grep to filter rout what you need:

head -1 eventalign.tsv> new.tsv 
grep -F -f kmers.txt eventalign.tsv>> new.tsv

Also, in that large file, any garbled rows you can remove by using awk to match those with only 15 columns:

awk '{if(NF==15) print}' eventalign.tsv > fixed.tsv

Meanwhile, the following would be helpful so that I could find the bug.

Could you do the following to see how many rows are garbled?

awk '{if(NF!=15) print}' eventalign.tsv | wc -l

Also, does this corrupted output appear if you rerun the f5c? Do you have the log file generated by f5c which you can share so I can inspect to see if there are any warnings?

josiegleeson commented 9 months ago

Hi,

Yep that's the exact command I ran above to find those lines, so garbled lines = 4. I think I can just filter them out but this takes a lot of time on my HPC as its such a big file.

That's a good idea, I'll rerun f5c and see if the exact same lines are garbled. It just takes a little while so I'll get it started now. I'll rerun with the same index first, and then also re-index and rerun.

I can't seem to see a log file unfortunately, I'll try and find this setting and rerun now.

Thanks!

hasindu2008 commented 9 months ago

What i meant by the log file is the standard error output.

josiegleeson commented 9 months ago

Ok I've found it:

[pthread_processor::83662.4111.48] 512 Entries (0.2M bases) processed [meth_main::83668.1761.48] 231 Entries (0.2M bases) loaded [pthread_processor::83668.335*1.48] 231 Entries (0.2M bases) processed

[meth_main] skipped unmapped: 0, skipped secondary: 0, skipped low_mapq: 19903034 [meth_main] total entries: 6426855, qc fail: 2418, could not calibrate: 32348, no alignment: 69201, bad reads: 0 [meth_main] total bases: 9555.6 Mbases [meth_main] Data loading time: 4705.203 sec [meth_main] - bam load time: 170.210 sec [meth_main] - fasta load time: 3777.178 sec [meth_main] - slow5 load time: 726.942 sec [meth_main] Data processing time: 15087.642 sec [meth_main] Data output time: 73842.480 sec [meth_main::WARNING]ESC[1;33m Skipped 19903034 reads with MAPQ < 20. Use --min-mapq to change the threshold.ESC[0m [main] Version: 1.3 [main] CMD: f5c eventalign --rna -b transcriptomic_aln.bam -r pass.fastq -g gencode.v31.transcripts.fa -o eventalign.tsv --kmer-model bin/rna004.nucleotide.5mer.model --slow5 merged.blow5 --signal-index --scale-events [main] Real time: 83668.914 sec; CPU time: 123951.777 sec; Peak RAM: 6.223 GB

There are many ...loaded and ...processed lines above this.

hasindu2008 commented 9 months ago

Is there a warning somewhere in the middle? You could attach the file and I can have a look.

josiegleeson commented 9 months ago

slurm-55049209.out.txt

hasindu2008 commented 9 months ago

Looking at the log, there is nothing either. I am now running on some large datasets to see. You also could try running agian if you wish. By the way, while not related to the issue I note the following:

  1. You are not specifying -t, so f5c is running with default 8 threads. You have allocated 24 CPUs in the job, so please specify -t24 so that f5c is utilising all cores and is potentially 3 times faster.
  2. 19,903,034 alignment records have been skipped because <20 mapq as reported by the read aligner and only 6,426,855 > 20 MAPQ. If you want more sensitivity, you may specify --min-mapq to a lower threshold
  3. As the file size is quite large, just in case this rare garbling happens again consider piping the output like (until I reproduce and fix the bug):
    f5c eventalign --rna -b aln.bam -r pass.fastq -g transcripts.fa  --kmer-model model --slow5 file.blow5 --signal-index --scale-events | awk '{if(NF==15){ print > "goodfile.tsv" } else {print > "badfile.tsv" } }' 

    This way, any bad entries (if any) will be written to badfile.tsv. Let meknow if you keep seeing such records for a scond time.

Furthermore, you can even pipe the output further to any other tools that support reading from standard in:

f5c eventalign --rna -b aln.bam -r pass.fastq -g transcripts.fa  --kmer-model model --slow5 file.blow5 --signal-index --scale-events | awk '{if(NF==15){ print  } else {print > "badfile.tsv" } }' | toool1 ... | tool2 ...

This way, writing whole big files can be completely eliminated and that is actually how f5c eventalign output is meant to be consumed.

f5c also can output the alignments in a compact sam format (--sam) which substantially reduces the file size. However ever, downstream tools like m6Anet yet doesn't support it I think.

josiegleeson commented 9 months ago

Ok great, I am re-running but it is still queued unfortunately :(

Ok I think I will lower the mapq threshold. Also by default does it only look at primary alignments? Is nanopolish the same? I am trying to run m6anet in a similar way but just for the updated RNA kits.

With the piping, would this be compatible with m6anet's dataprep step: m6anet dataprep --eventalign /path/to/m6anet/m6anet/tests/data/eventalign.txt \ --out_dir /path/to/output --n_processes 4

Thanks

hasindu2008 commented 9 months ago

By default f5c only processes primary and supplementary alignments. If you also want to include secondary alignments, you may give --secondary=yes. But my feeling is that secondary mappings must be anyway not included in the downstream analysis, generally.

Nanopolish latest I am not sure what the default options are as they might have changed from when I last looked at when implementing f5c. Given the new kit RNA004 comes with a new flowcell, signal properties have changed, so executing with same parameters as used for RNA002 may not be the ideal.

@GoekeLab and @yuukiiwa should be able to advice the best mapq cutoff for f5c for m6Anet when processing RNA004 data, as well as if m6anet's dataprep step can directly read from stdin.

yuukiiwa commented 9 months ago

Hi @hasindu2008 and @josiegleeson,

We find --min-mapq 0 works well for RNA004.

Thanks!

Best wishes, Yuk Kei

josiegleeson commented 9 months ago

Hi Hasindu,

Apologies for the delay! I re-ran f5c and I can't see these bad lines anymore.. I don't know if it just randomly occurred or maybe its to do with our HPC but it seems fine. I'll use the work around checking for bad lines in case for now but it should be all ok.

Thanks for your help!