Closed josiegleeson closed 8 months ago
Hi @josiegleeson This is a very strange potential bug that I have never seen before in f5c. Can I know which version of f5c you are using?
Sure, its 1.3. Built in a conda env. My eventalign contains around 10 million reads so maybe its a memory issue or something to do with the HPC I am using as well...
Out of curiosity, because these files get so big for m6A calling, is it possible to only output the eventalign for a list of kmers? Considering m6A is only found in DRACH motifs and m6anet only calls m6A within these kmers anyway. Just a thought to reduce file size...
Thanks for your help
10M reads is a typical promethION run, so this should be a bug. I will try to reproduce this issue by executing on large datasets like that.
Yes, there are ways to filter out kmers by using bash commands. Create a list of kmers you need into a file for instance called kmers.txt:
ACGTA
AAAAA
TATCA
Then say you have the large f5c event align file called events.tsv. First, extract the header, and then use grep to filter rout what you need:
head -1 eventalign.tsv> new.tsv
grep -F -f kmers.txt eventalign.tsv>> new.tsv
Also, in that large file, any garbled rows you can remove by using awk to match those with only 15 columns:
awk '{if(NF==15) print}' eventalign.tsv > fixed.tsv
Meanwhile, the following would be helpful so that I could find the bug.
Could you do the following to see how many rows are garbled?
awk '{if(NF!=15) print}' eventalign.tsv | wc -l
Also, does this corrupted output appear if you rerun the f5c? Do you have the log file generated by f5c which you can share so I can inspect to see if there are any warnings?
Hi,
Yep that's the exact command I ran above to find those lines, so garbled lines = 4. I think I can just filter them out but this takes a lot of time on my HPC as its such a big file.
That's a good idea, I'll rerun f5c and see if the exact same lines are garbled. It just takes a little while so I'll get it started now. I'll rerun with the same index first, and then also re-index and rerun.
I can't seem to see a log file unfortunately, I'll try and find this setting and rerun now.
Thanks!
What i meant by the log file is the standard error output.
Ok I've found it:
[pthread_processor::83662.4111.48] 512 Entries (0.2M bases) processed [meth_main::83668.1761.48] 231 Entries (0.2M bases) loaded [pthread_processor::83668.335*1.48] 231 Entries (0.2M bases) processed
[meth_main] skipped unmapped: 0, skipped secondary: 0, skipped low_mapq: 19903034 [meth_main] total entries: 6426855, qc fail: 2418, could not calibrate: 32348, no alignment: 69201, bad reads: 0 [meth_main] total bases: 9555.6 Mbases [meth_main] Data loading time: 4705.203 sec [meth_main] - bam load time: 170.210 sec [meth_main] - fasta load time: 3777.178 sec [meth_main] - slow5 load time: 726.942 sec [meth_main] Data processing time: 15087.642 sec [meth_main] Data output time: 73842.480 sec [meth_main::WARNING]ESC[1;33m Skipped 19903034 reads with MAPQ < 20. Use --min-mapq to change the threshold.ESC[0m [main] Version: 1.3 [main] CMD: f5c eventalign --rna -b transcriptomic_aln.bam -r pass.fastq -g gencode.v31.transcripts.fa -o eventalign.tsv --kmer-model bin/rna004.nucleotide.5mer.model --slow5 merged.blow5 --signal-index --scale-events [main] Real time: 83668.914 sec; CPU time: 123951.777 sec; Peak RAM: 6.223 GB
There are many ...loaded and ...processed lines above this.
Is there a warning somewhere in the middle? You could attach the file and I can have a look.
Looking at the log, there is nothing either. I am now running on some large datasets to see. You also could try running agian if you wish. By the way, while not related to the issue I note the following:
f5c eventalign --rna -b aln.bam -r pass.fastq -g transcripts.fa --kmer-model model --slow5 file.blow5 --signal-index --scale-events | awk '{if(NF==15){ print > "goodfile.tsv" } else {print > "badfile.tsv" } }'
This way, any bad entries (if any) will be written to badfile.tsv. Let meknow if you keep seeing such records for a scond time.
Furthermore, you can even pipe the output further to any other tools that support reading from standard in:
f5c eventalign --rna -b aln.bam -r pass.fastq -g transcripts.fa --kmer-model model --slow5 file.blow5 --signal-index --scale-events | awk '{if(NF==15){ print } else {print > "badfile.tsv" } }' | toool1 ... | tool2 ...
This way, writing whole big files can be completely eliminated and that is actually how f5c eventalign output is meant to be consumed.
f5c also can output the alignments in a compact sam format (--sam) which substantially reduces the file size. However ever, downstream tools like m6Anet yet doesn't support it I think.
Ok great, I am re-running but it is still queued unfortunately :(
Ok I think I will lower the mapq threshold. Also by default does it only look at primary alignments? Is nanopolish the same? I am trying to run m6anet in a similar way but just for the updated RNA kits.
With the piping, would this be compatible with m6anet's dataprep step: m6anet dataprep --eventalign /path/to/m6anet/m6anet/tests/data/eventalign.txt \ --out_dir /path/to/output --n_processes 4
Thanks
By default f5c only processes primary and supplementary alignments. If you also want to include secondary alignments, you may give --secondary=yes. But my feeling is that secondary mappings must be anyway not included in the downstream analysis, generally.
Nanopolish latest I am not sure what the default options are as they might have changed from when I last looked at when implementing f5c. Given the new kit RNA004 comes with a new flowcell, signal properties have changed, so executing with same parameters as used for RNA002 may not be the ideal.
@GoekeLab and @yuukiiwa should be able to advice the best mapq cutoff for f5c for m6Anet when processing RNA004 data, as well as if m6anet's dataprep step can directly read from stdin.
Hi @hasindu2008 and @josiegleeson,
We find --min-mapq 0
works well for RNA004.
Thanks!
Best wishes, Yuk Kei
Hi Hasindu,
Apologies for the delay! I re-ran f5c and I can't see these bad lines anymore.. I don't know if it just randomly occurred or maybe its to do with our HPC but it seems fine. I'll use the work around checking for bad lines in case for now but it should be all ok.
Thanks for your help!
Hi Hasindu,
I'm trying to use the latest version of m6anet for the new RNA004 kits. I've been following their instructions in the manual here: https://m6anet.readthedocs.io/en/latest/release_notes.html
However, when I am trying to run m6anet's inference step, some lines in my f5c eventalign output are causing an error as follows: pandas.errors.ParserError: Error tokenizing data. C error: Expected 15 fields in line 7517322566, saw 17
I am running f5c (with CPUs) as follows: f5c eventalign --rna -b transcriptomic_aln.bam \ -r pass.fastq \ -g gencode.v31.transcripts.fa \ -o eventalign.tsv \ --kmer-model rna004.nucleotide.5mer.model \ -x hpc \ -t 24 \ --slow5 merged.blow5 \ --signal-index \ --scale-events
There were 4 lines in my eventalign file that look strange:
Line 1: ENST00000269142.9|ENSG00000141384.12|OTTHUMG00000179429.3|OTTHUMT00000446260.3|TAF4B-201|TAF4B|5260|protein_coding| 589 TCENST00000269142.9|ENSG00000141384.12|OTTHUMG00000179429.3|OTTHUMT00000446260.3|TAF4B-201|TAF4B|5260|protein_coding| 561 CGCGT 18931981 t 11 83.32 6.641 0.00225 CGCGT 76.30 3.00 1.80 125102 125111 Line 2: ENST00000269142.9|ENSG00000141384.12|OTTHUMG00000179429.3|OTTHUMT00000446260.3|TAF4B-201|TAF4B|5260|protein_coding| 590 CGENST00000269142.9|ENSG00000141384.12|OTTHUMG00000179429.3|OTTHUMT00000446260.3|TAF4B-201|TAF4B|5260|protein_coding| 615 GAGCA 18931982 t 13 103.21 2.427 0.00525 GAGCA 99.77 3.00 0.87 113992 114013 Line 3: ENST00000579973.5|ENSG00000134504.13|OTTHUMG00000131947.6|OTTHUMT00000254906.2|KCTD1-206|KCTD1|1747|protein_coding| 159 CTAGA 18932128 t 47 103.3ENST00000579973.5|ENSG00000134504.13|OTTHUMG00000131947.6|OTTHUMT00000254906.2|KCTD1-206|KCTD1|1747|protein_coding| 125 GACAG 18932132 t 2 79.81 1.487 0.00375 GACAG 75.65 3.00 1.04 87063 87078 Line 4: ENST00000579973.5|ENSG00000134504.13|OTTHUMG00000131947.6|OTTHUMT00000254906.2|KCTD1-206|KCTD1|1747|protein_coding| 126 ACAGT 18932137 t 49 99.37 5.052 0.00ENST00000579973.5|ENSG00000134504.13|OTTHUMG00000131947.6|OTTHUMT00000254906.2|KCTD1-206|KCTD1|1747|protein_coding| 126 ACAGT 18932139 t 12 99.60 8.197 0.00275 ACAGT 104.38 3.00 -1.17 87680 87691
Its like the lines should have continued but the next lines started writing in weird places. For example, in the first two lines above, the kmer gets cutoff after two nucleotides and the next transcript id begins.
In the third line, the event_level_mean column contains the next lines transcript id, and in the fourth line the event_length column contains the next lines transcript id. For now I can try and manually remove these lines, but considering my eventalign file is ~2TB, this is quite time consuming and I have to run it only many samples.
Thanks for any help, much appreciated!