speed running - Githubissues

koujiaodahan commented 4 years ago

Hi, im running the software to assembly the human genome, i have runned one day, and it is still running,so how can i speed it? generally speaking , what memory perl thread? if i have sufficient memory ,can i set a bigger thread? my machine is 64cores,500G memory,here is my script: ~/backup_data/anaconda3/haslr/bin/haslr.py -t 8 -o ~/USER/lizhichao/Assembly/outdir/Assemblyoutput -g 3g -l ~/USER/lizhichao/Assembly/outdir/fastq/NA24385_ONT.fastq.gz -x nanopore -s ~/USER/lizhichao/Assembly/outdir/fastq/NA24385_T7.clean_1.fq.gz ~/USER/lizhichao/Assembly/outdir/fastq/NA24385_T7.clean_2.fq.gz &&\ echo "haslr finished

jelber2 commented 4 years ago

change -t 8 to -t 64 perhaps

koujiaodahan commented 4 years ago

Thanks, i have runned the 55 threads shell and not break the 8 threads process. How long do you think it will take to run the both scripts

jelber2 commented 4 years ago

Did you change the output directory? I have no idea how long it might take? Depends on coverage of long and short reads?

koujiaodahan commented 4 years ago

sure,i set a new output dir

koujiaodahan commented 4 years ago

it is always running minia for over 24 hours ,is it normal?

jelber2 commented 4 years ago

Minia is very fast, but genome size and coverage influence its runtime also probably choice of k-mer length and other similar types of settings.

koujiaodahan commented 4 years ago

So,is there any recommended parameters for running human genome assembly?

jelber2 commented 4 years ago

Are you trying out the assembler with someone else's data or do you have a new human genome assembly that you would like to make with your own data? I would think that it would have finished by now (~5 days running). Again, you haven't specified the coverage of the Illumina or I guess Oxford Nanopore data that you are using. You can also read the paper describing HASLR for perhaps more information on the program.

koujiaodahan commented 4 years ago

Sorry,im trying to assembly a human genome, The coverage of both short reads and long reads is 120X

jelber2 commented 4 years ago

I would recommend you try either GraphAligner (https://github.com/maickrau/GraphAligner) or Ratatosk (https://github.com/DecodeGenetics/Ratatosk) to error correct your Nanopore reads with your Illumina reads then assemble with Flye (https://github.com/fenderglass/Flye) using the --nano-corr option. Ratatosk even has a faster reference based method whereby to correct the reads (I haven't used this method, so I don't know the details). For Flye I really don't think you need 120x Nanopore coverage, especially if you can correct the reads. See here for running Ratatosk or here for running GraphAligner.

Edit: I guess you could use 120x Nanopore reads for a Human assembly (https://github.com/fenderglass/Flye/blob/3ee5b3390a5f88c36d0869d0382c75aba3b1f5cc/README.md#flye-benchmarks), although these data come from CHM13 (homozygous cell line). Also note the 4000 CPU hours (divide 4000 by number of available cores and you get approximately how many wall hours the assembly would take).

koujiaodahan commented 4 years ago

Thanks,jelber2. so haslr is not advised ?why?

jelber2 commented 4 years ago

In my experience, HASLR will generate very good statistics (N50, etc) for assembly using raw long reads and accurate short reads, but the error rate (indels and substitutions) of the final assembly is similar to the error rate of the long reads and not the short reads. One can improve the error rates by using long reads corrected by the short reads, and using the corrected long reads as input, but then the assembly statistics suffer. This is based off of simulation of course, and simulations are sometimes useful but can never fully capture the intricacies of real data.

haghshenas commented 4 years ago

Hi @koujiaodahan and thanks for trying HASLR. I'm surprised that Minia is taking so long to finish. In my experience, on short read datasets from human genome with about 40x coverage, it takes about 5 hours to finish. Are you sure that Minia assembly was the step that took a long time to finish? If yes, one solution could be subsampling short reads to about 40-50x coverage. You can use fastutils command that comes with HASLR for that purpose. So assuming you have a paired end dataset, you do the following:

fastutils interleave -q sr_1.fastq sr_2.fastq | fastutils subsample -q -g 3g -d 40 > sr_40x.fastq

With regards to the error rate of the final assembly that was raised by @jelber2, if you eventually want to perform polishing for your assembly, our results show that polished HASLR assemblies are as accurate as polished assemblies from other tools.

koujiaodahan commented 4 years ago

yeah, i agree that the coverage is too high,so i downsampled and i got error which i released at #20 . and i want to know whether your polishing method means running wtdbg2.pl after running haslr?

vpc-ccg / haslr

speed running #19