tleonardi / nanocompore

RNA modifications detection from Nanopore dRNA-Seq data
https://nanocompore.rna.rocks
GNU General Public License v3.0
78 stars 12 forks source link

[File Size and Run time] Nanocompore for 4 million reads #174

Closed wososa closed 3 years ago

wososa commented 3 years ago

Describe the bug I am using Nanocompore to compare RNA modification profiles of two samples. Each sample has ~4.3 million direct RNA-seq reads. The step of “eventalign” has been 4 days and generated a ~1.9TB file for each sample. I worry that I may not have enough disk space for Nanocompore to finish. Is there a way to reduce size or speed up? I am using 12 CPU now.

To Reproduce ~/tools/nanopolish/nanopolish index -s testrun1/sequencing_summary_testrun1.txt -d testrun1_fast5 testrun1.fastq ~/tools/nanopolish/nanopolish eventalign --reads testrun1.fastq --bam testrun1_transcriptome.Aligned.sortedByCoord.out.bam --genome Mouse_transcriptome.fa -t 12 --samples --print-read-names --scale-events > testrun1_eventalign_reads.tsv

Expected behavior Faster and smaller.

Additional context Nanocompore is a cute name.

a-slide commented 3 years ago

Hi @wososa, If I understand well, your bottleneck is Nanopolish eventalign.

Regarding the size, Nanopolish is well known to produce huge files but the size is significantly reduced after the eventalign_collapse step. It used to be done with NanopolishComp, but we recently moved the functionality directly in nanocompore (v1.0.2). the new version of eventalign_collapse can directly read gziped file, so you can compress the nanopolish eventalign output file to save some space. Alternatively, you can even directly pipe nanopolish eventalign into eventalign_collapse to skip the huge intermediate file.

As for the execution time, you can use more threads (if you have the resources). In the future, we would like to be able to use f5c instead of nanopolish but at the moment RNA is not supported (https://github.com/hasindu2008/f5c/issues/73)

Hope it helps

wososa commented 3 years ago

Hi @a-slide ,

Thanks for the answer. I apologize for my late reply. Nanocompore has been running for a very long time. The last step has been 5 days. should I wait for it to finish?

2021-01-25T23:58:51.227901-0500 ERROR - MainProcess | High fraction of invalid kmers (38.95%) for read 8cb14954-dc85-44fe-aa0f-e7cd65055099 2021-01-25T23:58:51.228206-0500 ERROR - MainProcess | High fraction of invalid kmers (23.8%) for read 1e236488-381a-4c29-ba3e-513231becc18 2021-01-25T23:58:54.423011-0500 INFO - MainProcess | References found in index: 61621 2021-01-25T23:58:55.055914-0500 INFO - MainProcess | Filtering out references with low coverage 2021-01-25T23:58:58.913828-0500 INFO - MainProcess | References remaining after reference coverage filtering: 23286 2021-01-25T23:59:04.278296-0500 INFO - MainProcess | Starting data processing 2021-02-06T04:33:46.735469-0500 INFO - Process-3 | All Done. Transcripts processed: 23286 2021-02-06T04:33:48.124521-0500 INFO - MainProcess | Loading SampCompDB 2021-02-06T04:33:49.913571-0500 INFO - MainProcess | Calculate results

Thanks, Woody

Stakaitis commented 3 years ago

I have the same question as @wososa about the last nanocompore step. Should I wait longer or not? I'm running the nanocompore_pipeline with 64 CPUs per task and the nanocompore step is running for 28h now. The MainProcess | Calculate results process is running for almost 12 hours These are the files that have been generated from nanocompore step so far: _94431226 Feb 12 00:19 outsampcomp.log 1060470 Feb 12 00:19 outSampComp.db.bak 253084675122 Feb 12 00:19 outSampComp.db.dat 1060470 Feb 12 00:19 outSampComp.db.dir

a-slide commented 3 years ago

Hi both, The last step is not multi-processed so it doesn't matter if you use 64 CPUs or 1.But it should take that long, so I assume something went wrong. What would be very useful is 1) The version of nanocompore you are using and 2) a debug log (running sampcomp with --log_level debug) Thanks

Stakaitis commented 3 years ago

Hi @a-slide,

I'm using nanocompore 1.0.3 from pypi.

Here are the first and the last 50 lines of _outsampcomp.log file generated with the default --log_level argument. I'm now running nanocompore sampcomp with --log_level debug. Will upload the debug .log file when it generates more info. head_and_tail_50_of_out_sampcomp.log

Stakaitis commented 3 years ago

A debug .log file after 40 min of nanocompore sampcomp step: 40min_running_debug_out_sampcomp.log

wososa commented 3 years ago

Hi @a-slide ,

Thanks for follow up. When I set --nthreads 40, nanocompore actually used hundreds of CPU threads. I had to stop and use the default (3 CPU threads).

I am using v1.0.2 from anaconda.

source ~/anaconda3/etc/profile.d/conda.sh conda activate nanocompore nanocompore sampcomp \ --file_list1 ../testrun1_eventalign_collapsed_reads.tsv/out_eventalign_collapse.tsv \ --file_list2 ../testrun2_eventalign_collapsed_reads.tsv/out_eventalign_collapse.tsv \ --label1 testrun1 \ --label2 testrun2 \ --fasta ../mm10_transcriptome.fa \ --overwrite \ --log_level debug \ --outpath debug

Log from debug mode:

sed '/ERROR/d' debug_output.txt |less

2021-02-13T10:24:27.600209-0500 WARNING - MainProcess | Running SampComp 2021-02-13T10:24:27.600777-0500 INFO - MainProcess | Checking and initialising SampComp 2021-02-13T10:24:27.600952-0500 DEBUG - MainProcess | package_name: nanocompore 2021-02-13T10:24:27.601094-0500 DEBUG - MainProcess | package_version: 1.0.2 2021-02-13T10:24:27.601227-0500 DEBUG - MainProcess | timestamp: 2021-02-13 10:24:27.601218 2021-02-13T10:24:27.601357-0500 DEBUG - MainProcess | progress: False 2021-02-13T10:24:27.601482-0500 DEBUG - MainProcess | nthreads: 3 2021-02-13T10:24:27.601612-0500 DEBUG - MainProcess | exclude_ref_id: [] 2021-02-13T10:24:27.601736-0500 DEBUG - MainProcess | select_ref_id: [] 2021-02-13T10:24:27.601930-0500 DEBUG - MainProcess | max_invalid_kmers_freq: 0.1 2021-02-13T10:24:27.602065-0500 DEBUG - MainProcess | downsample_high_coverage: 5000 2021-02-13T10:24:27.602187-0500 DEBUG - MainProcess | min_ref_length: 100 2021-02-13T10:24:27.602307-0500 DEBUG - MainProcess | min_coverage: 30 2021-02-13T10:24:27.602426-0500 DEBUG - MainProcess | sequence_context_weights: uniform 2021-02-13T10:24:27.602548-0500 DEBUG - MainProcess | sequence_context: 0 2021-02-13T10:24:27.602673-0500 DEBUG - MainProcess | allow_warnings: False 2021-02-13T10:24:27.602813-0500 DEBUG - MainProcess | anova: False 2021-02-13T10:24:27.602940-0500 DEBUG - MainProcess | logit: True 2021-02-13T10:24:27.603067-0500 DEBUG - MainProcess | comparison_methods: GMM,KS 2021-02-13T10:24:27.603194-0500 DEBUG - MainProcess | overwrite: True 2021-02-13T10:24:27.603318-0500 DEBUG - MainProcess | outprefix: out 2021-02-13T10:24:27.603499-0500 DEBUG - MainProcess | outpath: /grid/krainer/home/klin/reads/PromethION/nanocompore/debug 2021-02-13T10:24:27.603705-0500 DEBUG - MainProcess | fasta_fn: ../mm10_transcriptome.fa 2021-02-13T10:24:27.603851-0500 INFO - MainProcess | Only 1 replicate found for condition testrun1 2021-02-13T10:24:27.604019-0500 INFO - MainProcess | This is not recommended. The statistics will be calculated with the logit method 2021-02-13T10:24:27.604147-0500 INFO - MainProcess | Only 1 replicate found for condition testrun2 2021-02-13T10:24:27.604264-0500 INFO - MainProcess | This is not recommended. The statistics will be calculated with the logit method 2021-02-13T10:24:27.605121-0500 DEBUG - MainProcess | OrderedDict([('testrun1', {'testrun1_1': '../testrun1_eventalign_collapsed_reads.tsv/out_eventalign_collapse.tsv'}), ('testrun2', {'testrun2_1': '../testrun2_eventalign_collapsed_reads.tsv/out_eventalign_collapse.tsv'})]) 2021-02-13T10:24:28.511439-0500 INFO - MainProcess | Reading eventalign index files

wososa commented 3 years ago

Hi @a-slide , A more detailed log.txt after 24 hours. log.txt

a-slide commented 3 years ago

OK thanks @wososa and @Stakaitis, Tom and I will have a closer look. but we have the feeling that you both have much larger datasets than Nanocompore was initially designed for. As a temporary fix, I would maybe suggest to downsample your datasets using --downsample_high_coverage 1000. I know it is not perfect but at least it should lower the memory and CPU load.

@wososa We though the CPU over subscription was fixed but apparently not on every platform.

Thanks for helping getting to the bottom of the issue

tleonardi commented 3 years ago

Hi @wososa and @Stakaitis, thanks for sending your log files, I now had a better look. In both cases, the problem is the very high number of transcripts being processed, @Stakaitis, you are running nanocompore with the --min_coverage option set to 1: however, with fewer than ~30 reads per transcript there isn't enough statistical power to detect modifications, so you should set a higher filtering threshold (i.e. --min_coverage 30, which is the default). This way, Nanocompore would have to process fewer transcripts and hence complete faster.

@wososa, you are already using the default --min_coverage 30 option but still you have a very large dataset (~7mln reads per sample, 23286 transcripts with coverage >30. I'm assuming it was sequenced on Promethion). We have never analyzed such a large dataset and I'm afraid several parts of nanocompore are not optimised for this scenario. We are working on a new release with several technical improvements on the way the data are stored and processed, but it will be a few months before the release is ready. In the meantime, I have two suggestions: try upgrading to nanocompore v1.0.3 where we have improved multithreading and try to downsample the dataset, like @a-slide suggested (--downsample_high_coverage 1000). From our benchmarks we have seen that 1000 reads per transcript are enough to reliably call modifications, so I don't think downsampling would be a major problem.

Thanks to you both for your patience and you helpful debugging information! tom

tleonardi commented 3 years ago

@Stakaitis if --min-coverage 30 doesn't solve you problem feel free to reopen this issue!

wososa commented 3 years ago

@a-slide @tleonardi ,

Luckily, my nanocompore run finished. I am trying to visualize results, but I got stuck at the Calculate results step for 24 hours. It is probably due to again the amount of reads, so I reopen this thread:

Python 3.6.12 |Anaconda, Inc.| (default, Sep  8 2020, 23:10:56)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nanocompore.SampCompDB import SampCompDB, jhelp
>>> db=SampCompDB(db_fn="outSampComp.db",fasta_fn="mm10_transcriptome.fa")
2021-02-20 09:50:12.564 | INFO     | nanocompore.SampCompDB:__init__:55 - Loading SampCompDB
2021-02-20 09:50:12.873 | DEBUG    | nanocompore.SampCompDB:__init__:62 -       Reading Metadata
2021-02-20 09:50:12.888 | DEBUG    | nanocompore.SampCompDB:__init__:68 -       Loading list of reference ids
2021-02-20 09:50:12.965 | DEBUG    | nanocompore.SampCompDB:__init__:82 - Checking files and arg values
2021-02-20 09:50:13.011 | INFO     | nanocompore.SampCompDB:__init__:100 - Calculate results

Thanks, Woody