tleonardi / nanocompore

RNA modifications detection from Nanopore dRNA-Seq data
https://nanocompore.rna.rocks
GNU General Public License v3.0
78 stars 12 forks source link

Nanocompore SampComp stuck at 0% #222

Open sidizhao opened 1 year ago

sidizhao commented 1 year ago

Describe the bug Hi, I've been trying to run SampComp on 6 samples of ONT Direct RNA-seq on a METTL3 KD cell line for some time, and have yet to get past the "parse transcript" step. I have 512G of RAM requested to run this and it just gets stuck for multiple days. Here's the log:

2023-07-08T01:55:22.910444+0000 WARNING - MainProcess | Running SampComp
2023-07-08T01:55:22.910969+0000 INFO - MainProcess | Checking and initialising SampComp
2023-07-08T01:55:22.911463+0000 DEBUG - MainProcess |   package_name: nanocompore
2023-07-08T01:55:22.911790+0000 DEBUG - MainProcess |   package_version: 1.0.4
2023-07-08T01:55:22.912016+0000 DEBUG - MainProcess |   timestamp: 2023-07-08 01:55:22.912005
2023-07-08T01:55:22.912189+0000 DEBUG - MainProcess |   fasta_fn: /storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/all-chrs.fa
2023-07-08T01:55:22.912390+0000 DEBUG - MainProcess |   outpath: /storage1/fs1/christophermaher/Active/maherlab/getian/pr1/results/test
2023-07-08T01:55:22.912605+0000 DEBUG - MainProcess |   outprefix: out
2023-07-08T01:55:22.913050+0000 DEBUG - MainProcess |   overwrite: True
2023-07-08T01:55:22.913381+0000 DEBUG - MainProcess |   comparison_methods: GMM,KS
2023-07-08T01:55:22.913537+0000 DEBUG - MainProcess |   logit: True
2023-07-08T01:55:22.913793+0000 DEBUG - MainProcess |   anova: False
2023-07-08T01:55:22.913928+0000 DEBUG - MainProcess |   allow_warnings: True
2023-07-08T01:55:22.914056+0000 DEBUG - MainProcess |   sequence_context: 0
2023-07-08T01:55:22.914176+0000 DEBUG - MainProcess |   sequence_context_weights: uniform
2023-07-08T01:55:22.914547+0000 DEBUG - MainProcess |   min_coverage: 30
2023-07-08T01:55:22.914672+0000 DEBUG - MainProcess |   min_ref_length: 100
2023-07-08T01:55:22.914795+0000 DEBUG - MainProcess |   downsample_high_coverage: 5000
2023-07-08T01:55:22.914919+0000 DEBUG - MainProcess |   max_invalid_kmers_freq: 0.1
2023-07-08T01:55:22.915044+0000 DEBUG - MainProcess |   select_ref_id: []
2023-07-08T01:55:22.915164+0000 DEBUG - MainProcess |   exclude_ref_id: []
2023-07-08T01:55:22.915294+0000 DEBUG - MainProcess |   nthreads: 3
2023-07-08T01:55:22.915415+0000 DEBUG - MainProcess |   progress: True
2023-07-08T01:55:22.962418+0000 DEBUG - MainProcess | OrderedDict([('ctrl', {'ctrl_1': '/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAR62703/FAR62703_eventalign_collapse.tsv', 'ctrl_2': '/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAU62362/FAU62362_eventalign_collapse.tsv'}), ('kd', {'kd_1': '/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAU65190/FAU65190_eventalign_collapse.tsv', 'kd_2': '/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/PAM12362/PAM12362_eventalign_collapse.tsv', 'kd_3': '/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAS65798/FAS65798_eventalign_collapse.tsv', 'kd_4': '/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAS65713/FAS65713_eventalign_collapse.tsv'})])
2023-07-08T01:55:23.258677+0000 INFO - MainProcess | Reading eventalign index files
2023-07-08T01:55:59.601494+0000 DEBUG - MainProcess |   Condition:ctrl Sample:ctrl_1    High fraction of invalid kmers: 1,437,090   valid reads: 418,498
2023-07-08T01:56:29.285645+0000 DEBUG - MainProcess |   Condition:ctrl Sample:ctrl_2    High fraction of invalid kmers: 1,170,379   valid reads: 335,198
2023-07-08T01:57:09.931337+0000 DEBUG - MainProcess |   Condition:kd Sample:kd_1    High fraction of invalid kmers: 1,572,437   valid reads: 485,752
2023-07-08T02:00:49.691289+0000 DEBUG - MainProcess |   Condition:kd Sample:kd_2    High fraction of invalid kmers: 8,258,366   valid reads: 2,978,007
2023-07-08T02:01:12.363506+0000 DEBUG - MainProcess |   Condition:kd Sample:kd_3    High fraction of invalid kmers: 880,577 valid reads: 302,132
2023-07-08T02:01:40.454853+0000 DEBUG - MainProcess |   Condition:kd Sample:kd_4    High fraction of invalid kmers: 948,443 valid reads: 309,549
2023-07-08T02:01:40.455894+0000 INFO - MainProcess |    References found in index: 203
2023-07-08T02:01:40.456229+0000 INFO - MainProcess | Filtering out references with low coverage
2023-07-08T02:01:41.073182+0000 DEBUG - MainProcess |   kd kd_2 Reads: 218,264  ctrl ctrl_1 Reads: 146,950  kd kd_1 Reads: 144,722  ctrl ctrl_2 Reads: 136,951  kd kd_3 Reads: 132,693  kd kd_4 Reads: 130,008  invalid_ref_id: 137 valid_ref_id: 66
2023-07-08T02:01:41.073543+0000 INFO - MainProcess |    References remaining after reference coverage filtering: 66
2023-07-08T02:01:45.479003+0000 INFO - MainProcess | Starting data processing
2023-07-08T02:01:45.584960+0000 DEBUG - Process-1 | Adding chr1 to in_q
2023-07-08T02:01:45.675927+0000 DEBUG - Process-2 | Worker thread started
2023-07-08T02:01:45.621621+0000 DEBUG - Process-1 | Adding chr10 to in_q
2023-07-08T02:01:46.053949+0000 DEBUG - Process-1 | Adding chr11 to in_q
2023-07-08T02:01:46.297462+0000 DEBUG - Process-1 | Adding chr12 to in_q
2023-07-08T02:01:46.342287+0000 DEBUG - Process-1 | Adding chr13 to in_q
2023-07-08T02:01:46.370333+0000 DEBUG - Process-1 | Adding chr14 to in_q
2023-07-08T02:01:46.414894+0000 DEBUG - Process-1 | Adding chr15 to in_q
2023-07-08T02:01:46.446543+0000 DEBUG - Process-1 | Adding chr16 to in_q
2023-07-08T02:01:46.498344+0000 DEBUG - Process-1 | Adding chr16_KI270728v1_random to in_q
2023-07-08T02:01:46.549909+0000 DEBUG - Process-1 | Adding chr17 to in_q
2023-07-08T02:01:46.589211+0000 DEBUG - Process-1 | Adding chr18 to in_q
2023-07-08T02:01:46.624302+0000 DEBUG - Process-1 | Adding chr19 to in_q
2023-07-08T02:01:46.653000+0000 DEBUG - Process-1 | Adding chr2 to in_q
2023-07-08T02:01:46.703102+0000 DEBUG - Process-1 | Adding chr20 to in_q
2023-07-08T02:01:46.747155+0000 DEBUG - Process-1 | Adding chr21 to in_q
2023-07-08T02:01:46.791885+0000 DEBUG - Process-1 | Adding chr22 to in_q
2023-07-08T02:01:46.839534+0000 DEBUG - Process-1 | Adding chr22_KI270733v1_random to in_q
2023-07-08T02:01:46.877947+0000 DEBUG - Process-1 | Adding chr3 to in_q
2023-07-08T02:01:46.922850+0000 DEBUG - Process-1 | Adding chr4 to in_q
2023-07-08T02:01:46.961869+0000 DEBUG - Process-1 | Adding chr5 to in_q
2023-07-08T02:01:46.991997+0000 DEBUG - Process-1 | Adding chr6 to in_q
2023-07-08T02:01:47.070883+0000 DEBUG - Process-2 | Worker thread processing new item from in_q: chr1
2023-07-08T02:01:47.065802+0000 DEBUG - Process-1 | Adding chr7 to in_q
2023-07-08T02:01:47.098081+0000 DEBUG - Process-1 | Adding chr8 to in_q
2023-07-08T02:01:47.098439+0000 DEBUG - Process-1 | Adding chr9 to in_q
2023-07-08T02:01:47.098590+0000 DEBUG - Process-1 | Adding chr4_GL000257v2_alt to in_q
2023-07-08T02:01:47.098738+0000 DEBUG - Process-1 | Adding chr5_KI270791v1_alt to in_q
2023-07-08T02:01:47.098868+0000 DEBUG - Process-1 | Adding chr5_GL339449v2_alt to in_q
2023-07-08T02:01:47.099001+0000 DEBUG - Process-1 | Adding chr6_GL000250v2_alt to in_q
2023-07-08T02:01:47.099134+0000 DEBUG - Process-1 | Adding chr7_KI270806v1_alt to in_q
2023-07-08T02:01:47.099274+0000 DEBUG - Process-1 | Adding chr7_GL383534v2_alt to in_q
2023-07-08T02:01:47.099406+0000 DEBUG - Process-1 | Adding chr8_KI270817v1_alt to in_q
2023-07-08T02:01:47.099532+0000 DEBUG - Process-1 | Adding chr15_GL383554v1_alt to in_q
2023-07-08T02:01:47.099654+0000 DEBUG - Process-1 | Adding chr15_KI270849v1_alt to in_q
2023-07-08T02:01:47.099776+0000 DEBUG - Process-1 | Adding chr16_KI270855v1_alt to in_q
2023-07-08T02:01:47.099896+0000 DEBUG - Process-1 | Adding chr16_KI270853v1_alt to in_q
2023-07-08T02:01:47.100018+0000 DEBUG - Process-1 | Adding chr17_KI270857v1_alt to in_q
2023-07-08T02:01:47.100149+0000 DEBUG - Process-1 | Adding chr17_GL000258v2_alt to in_q
2023-07-08T02:01:47.100283+0000 DEBUG - Process-1 | Adding chr19_KI270868v1_alt to in_q
2023-07-08T02:01:47.100407+0000 DEBUG - Process-1 | Adding chr19_KI270866v1_alt to in_q
2023-07-08T02:01:47.100528+0000 DEBUG - Process-1 | Adding chr19_GL949746v1_alt to in_q
2023-07-08T02:01:47.100648+0000 DEBUG - Process-1 | Adding chr21_GL383581v2_alt to in_q
2023-07-08T02:01:47.100767+0000 DEBUG - Process-1 | Adding chr22_KI270879v1_alt to in_q
2023-07-08T02:01:47.100885+0000 DEBUG - Process-1 | Adding chr22_GL383582v2_alt to in_q
2023-07-08T02:01:47.101000+0000 DEBUG - Process-1 | Adding chr5_KI270897v1_alt to in_q
2023-07-08T02:01:47.101115+0000 DEBUG - Process-1 | Adding chr6_GL000251v2_alt to in_q
2023-07-08T02:01:47.101233+0000 DEBUG - Process-1 | Adding chr15_KI270905v1_alt to in_q
2023-07-08T02:01:47.101363+0000 DEBUG - Process-1 | Adding chr17_KI270908v1_alt to in_q
2023-07-08T02:01:47.101481+0000 DEBUG - Process-1 | Adding chr19_GL949747v2_alt to in_q
2023-07-08T02:01:47.101598+0000 DEBUG - Process-1 | Adding chr22_KB663609v1_alt to in_q
2023-07-08T02:01:47.101713+0000 DEBUG - Process-1 | Adding chr6_GL000252v2_alt to in_q
2023-07-08T02:01:47.101829+0000 DEBUG - Process-1 | Adding chr19_GL949748v2_alt to in_q
2023-07-08T02:01:47.101956+0000 DEBUG - Process-1 | Adding chr22_KI270928v1_alt to in_q
2023-07-08T02:01:47.102074+0000 DEBUG - Process-1 | Adding chr6_GL000253v2_alt to in_q
2023-07-08T02:01:47.102191+0000 DEBUG - Process-1 | Adding chr19_GL949749v2_alt to in_q
2023-07-08T02:01:47.102318+0000 DEBUG - Process-1 | Adding chr6_GL000254v2_alt to in_q
2023-07-08T02:01:47.102434+0000 DEBUG - Process-1 | Adding chr19_GL949750v2_alt to in_q
2023-07-08T02:01:47.102550+0000 DEBUG - Process-1 | Adding chr6_GL000255v2_alt to in_q
2023-07-08T02:01:47.102667+0000 DEBUG - Process-1 | Adding chr19_GL949751v2_alt to in_q
2023-07-08T02:01:47.102788+0000 DEBUG - Process-1 | Adding chr6_GL000256v2_alt to in_q
2023-07-08T02:01:47.102910+0000 DEBUG - Process-1 | Adding chr19_GL949752v1_alt to in_q
2023-07-08T02:01:47.103025+0000 DEBUG - Process-1 | Adding chr19_GL949753v2_alt to in_q
2023-07-08T02:01:47.103144+0000 DEBUG - Process-1 | Adding chr19_KI270938v1_alt to in_q
2023-07-08T02:01:47.103263+0000 DEBUG - Process-1 | Adding chrM to in_q
2023-07-08T02:01:47.103387+0000 DEBUG - Process-1 | Adding chrUn_GL000220v1 to in_q
2023-07-08T02:01:47.103511+0000 DEBUG - Process-1 | Adding chrX to in_q
2023-07-08T02:01:47.103630+0000 DEBUG - Process-1 | Adding chrY to in_q
2023-07-08T02:01:47.103761+0000 DEBUG - Process-1 | Parsed transcripts:66
  0%|          | 0/66 [00:00<?, ? Processed References/s]

To Reproduce I ran a bash script based on our linux computing cluster using the docker image quay.io/biocontainers/nanocompore:1.0.4--pyhdfd78af_0. Here's the commad:

cmd="nanocompore sampcomp \
    --file_list1 /storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAR62703/FAR62703_eventalign_collapse.tsv,/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAU62362/FAU62362_eventalign_collapse.tsv \
    --file_list2 /storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAU65190/FAU65190_eventalign_collapse.tsv,/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/PAM12362/PAM12362_eventalign_collapse.tsv,/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAS65798/FAS65798_eventalign_collapse.tsv,/storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/nanocompore_files/nanocompore_output/FAS65713/FAS65713_eventalign_collapse.tsv --label1 ctrl \
    --label2 kd \
    --fasta /storage1/fs1/christophermaher/Active/maherlab/sidizhao/circ_rna/long_read/m6a_methlytion_direct_rna/all-chrs.fa \
    --outpath /storage1/fs1/christophermaher/Active/maherlab/getian/pr1/results/test --progress --overwrite --allow_warnings --log_level debug"

Would you be able to help me? Thank you.

lmulroney commented 1 year ago

Hi @sidizhao,

It looks like you ran nanocompore on a genomic reference instead of a transcriptomic reference. Sometimes Nanocompore can stall when the reference sequences are super long (greater than 50kb), and this is likely the reason that you're experiencing a long execution time. You can either kill the process and start it again and see if it gets through the stall that way (this sometimes works and we don't know why), or start the whole pipeline over again aligning to a transcriptome reference fasta. Given you have all the data together, it might be worth simply restarting it and seeing if that works, but I suspect that redoing the pipeline with a transcriptome reference is better. If you provide SampComp a bed file, it will do an internal liftover from transcriptome reference coordinates to genome reference coordinates.

I hope this helps, Logan

sidizhao commented 1 year ago

Hi,

Thank you for the prompt response. By transcriptomic reference, do you mean only the exonic regions of the fasta file? Or if I were to provide a bed file, what should the bed file contain? Just trying to clarify.

On Tue, Jul 11, 2023 at 04:48 lmulroney @.***> wrote:

Hi @sidizhao https://github.com/sidizhao,

It looks like you ran nanocompore on a genomic reference instead of a transcriptomic reference. Sometimes Nanocompore can stall when the reference sequences are super long (greater than 50kb), and this is likely the reason that you're experiencing a long execution time. You can either kill the process and start it again and see if it gets through the stall that way (this sometimes works and we don't know why), or start the whole pipeline over again aligning to a transcriptome reference fasta. Given you have all the data together, it might be worth simply restarting it and seeing if that works, but I suspect that redoing the pipeline with a transcriptome reference is better. If you provide SampComp a bed file, it will do an internal liftover from transcriptome reference coordinates to genome reference coordinates.

I hope this helps, Logan

— Reply to this email directly, view it on GitHub https://github.com/tleonardi/nanocompore/issues/222#issuecomment-1630513841, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKH54EOAWW3LSH6CN3NBS6LXPUOO5ANCNFSM6AAAAAA2EZ4I7I . You are receiving this because you were mentioned.Message ID: @.***>

lmulroney commented 1 year ago

Hi @sidizhao,

Yes, by transcriptomic reference I mean a reference fasta file of each contiguous transcript isoform with no introns present, and one reference sequence per isoform. This is the download link for the genocde transcriptome reference fasta. https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.transcripts.fa.gz

Alternatively, you can create a reference transcriptome fasta from the reference genome and a gtf file using bedtools get fasta.

And by bed file, I mean a bed file that matches the transcriptome reference fasta file in genomic coordinates. So you can use something like bedparse (https://github.com/tleonardi/bedparse) to convert a gtf file to bed12 format. You can find the gencode reference gtf file on the gencode home page.

Does this make sense? Logan

sidizhao commented 1 year ago

Yes. Thank you. Would it work if I keep the current genomic fasta file but add a bed12 file of only the transcripts? Or do I necessarily need to download the transcripts only fasta?

lmulroney commented 1 year ago

You essentially need to start over from the minimap2 step using the transcriptome reference fasta instead of the genome reference fasta. This will require that you redo eventalign and eventalign collapse as well from this new bam file. Importantly, you do not want to align in splice aware mode when using a transcriptome reference fasta.

You can find more detailed instructions here ( https://doi.org/10.1002/cpz1.683) or here (https://nanocompore.rna.rocks/) if you want a more in depth breakdown of the steps.

Let me know if you have more questions.

Logan

sidizhao commented 1 year ago

Oh wow, I see. That is going to take a while since these Direct RNA-seq files take a long time on nanopolish. I will come back with more questions if it still doesn't work. Thank you so much.

lmulroney commented 1 year ago

You can try using f5c instead of nanopolish. It is a c implementation of nanopolish and is roughly 10 times faster. There are a few flags you need to use that are unique to f5c that are not used by nanopolish. The protocol paper I posted earlier goes through all the necessary differences using f5c compared to nanopolish if you decide to give it a try.

Briefly, you need to use --rna --min-mapq=0 --secondary=yes in addition to all the normal nanopolish commands

But I'm doing this from memory, so double check the help messages to make sure I have the spelling correct!!!

Logan

sidizhao commented 1 year ago

Thank you so much! We’ve been using slow5tools to process the fast5 files first before nanopolish, and it’s been decently fast.

On Tue, Jul 11, 2023 at 13:12 lmulroney @.***> wrote:

You can try using f5c instead of nanopolish. It is a c implementation of nanopolish and is roughly 10 times faster. There are a few flags you need to use that are unique to f5c that are not used by nanopolish. The protocol paper I posted earlier goes through all the necessary differences using f5c compared to nanopolish if you decide to give it a try.

Briefly, you need to use --rna --min-mapq=0 --secondary=yes in addition to all the normal nanopolish commands

But I'm doing this from memory, so double check the help messages to make sure I have the spelling correct!!!

Logan

— Reply to this email directly, view it on GitHub https://github.com/tleonardi/nanocompore/issues/222#issuecomment-1631273698, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKH54EJK2IKBOTGTNDR5B5TXPWJQ3ANCNFSM6AAAAAA2EZ4I7I . You are receiving this because you were mentioned.Message ID: @.***>