svm-zhang / AGOUTI

Annotated Genome Optimization Using Transcriptome Information
MIT License
20 stars 8 forks source link

support of multiple bam files? #9

Closed scotty323 closed 6 years ago

scotty323 commented 7 years ago

Hi Zhang,

I am able to use AGOUTI with one bam file. How can I use multiple bam files simontaneously?

Tao

svm-zhang commented 7 years ago

Hello @Tao,

Thanks for using AGOUTI.

AGOUTI is yet to support to read multiple BAM files simultaneously. I guess you have to first cat your BAM into a single one.

Also can I ask what size of each of your BAMs?

Simo

scotty323 commented 7 years ago

BAM files are 3G on average, 12 bam files. How can I increase the thread number and memory to run the merged bam file?

2017-04-11 09:41:28,500 - INFO - AGOUTI_DENOISE PROGRESS - [BEGIN] Denoising joining pairs 2017-04-11 09:41:49,176 - INFO - AGOUTI_DENOISE PROGRESS - Succeeded 2017-04-11 09:41:49,177 - INFO - AGOUTI_DENOISE PROGRESS - Denoise took in 0.34 min CPU time 2017-04-11 09:41:49,177 - INFO - AGOUTI_DENOISE PROGRESS - 613 contig pairs filtered for spanning across >1 gene models 2017-04-11 09:41:49,177 - INFO - AGOUTI_DENOISE PROGRESS - 39 contig pairs filtered for not being one of the four combinations 2017-04-11 09:41:49,177 - INFO - AGOUTI_DENOISE PROGRESS - 1526 contig pairs filtered for less support 2017-04-11 09:41:49,177 - INFO - AGOUTI_DENOISE PROGRESS - 9 contig pairs for scaffolding 2017-04-11 09:41:49,178 - INFO - AGOUTI_SCAFFOLDING PROGRESS - Building graph from joining reads pairs 2017-04-11 09:41:49,179 - INFO - AGOUTI_SCAFFOLDING PROGRESS - Build graph took 0.0000 min CPU time 2017-04-11 09:41:49,179 - INFO - AGOUTI_SCAFFOLDING PROGRESS - 16 vertices in the graph 2017-04-11 09:41:49,179 - INFO - AGOUTI_SCAFFOLDING PROGRESS - Simplifying graph 2017-04-11 09:41:49,179 - INFO - AGOUTI_SCAFFOLDING PROGRESS - 0 Edges removed due to insufficient supports 2017-04-11 09:41:49,179 - INFO - AGOUTI_SCAFFOLDING PROGRESS - Start graph walk 2017-04-11 09:41:49,179 - INFO - AGOUTI_SCAFFOLDING PROGRESS - number of visited nodes: 16 2017-04-11 09:41:49,180 - INFO - AGOUTI_SCAFFOLDING PROGRESS - Scaffolding took 0.0000 min CPU time 2017-04-11 09:41:49,180 - INFO - AGOUTI_SCAFFOLDING PROGRESS - Graph Reconciliation 2017-04-11 09:41:49,180 - INFO - AGOUTI_SCAFFOLDING PROGRESS - Reconciliation took 0.0000 min CPU time 2017-04-11 09:41:49,180 - INFO - AGOUTI_SCAFFOLDING PROGRESS - Report scaffolding paths 2017-04-11 09:41:49,181 - INFO - AGOUTI_SCAFFOLDING PROGRESS - Visualize graph in DOT 2017-04-11 09:41:49,204 - INFO - AGOUTI_UPDATE PROGRESS - [BEGIN] Updating gene models 2017-04-11 09:41:49,247 - INFO - AGOUTI_UPDATE PROGRESS - Finalizing sequences 2017-04-11 09:41:55,456 - INFO - AGOUTI_UPDATE PROGRESS - Outputting updated Gene Moddels 2017-04-11 09:41:56,316 - INFO - AGOUTI_UPDATE PROGRESS - Summarizing AGOUTI gene paths 2017-04-11 09:41:56,317 - INFO - AGOUTI_UPDATE PROGRESS - -----------Summary----------- 2017-04-11 09:41:56,317 - INFO - AGOUTI_UPDATE PROGRESS - number of contigs scaffoled: 15 2017-04-11 09:41:56,317 - INFO - AGOUTI_UPDATE PROGRESS - number of scaffolds: 7 2017-04-11 09:41:56,317 - INFO - AGOUTI_UPDATE PROGRESS - number of contigs in the final assembly: 3326 2017-04-11 09:41:56,318 - INFO - AGOUTI_UPDATE PROGRESS - Final assembly N50: 60718603 2017-04-11 09:41:56,318 - INFO - AGOUTI_UPDATE PROGRESS - Final number of genes: 26688 2017-04-11 09:41:56,318 - INFO - AGOUTI_UPDATE PROGRESS - Succeeded 2017-04-11 09:41:56,318 - INFO - PARSE_ARGS PROGRESS - Peak memory use: 1.00000 GB

svm-zhang commented 7 years ago

AGOUTI currently can only use single thread for reading. As for memory, can I ask what species is this? And are you running on your local computer or a cluster?

scotty323 commented 7 years ago

The species is sacred lotus, and the assembled draft genome (with genetic map) is about 1 G.

It is a desktop server:

[lzc@localhost ~]$ free -lh total used free shared buff/cache available Mem: 251G 1.9G 93G 326M 156G 249G Low: 251G 158G 93G High: 0B 0B 0B Swap: 4.0G 191M 3.8G

scotty323 commented 7 years ago

Other information about the server:

top - 21:37:19 up 1 day, 17:25, 2 users, load average: 0.00, 0.04, 0.05 Tasks: 568 total, 1 running, 489 sleeping, 0 stopped, 78 zombie %Cpu(s): 0.1 us, 0.0 sy, 0.0 ni, 99.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 26392755+total, 97769840 free, 2010016 used, 16414768+buff/cache KiB Swap: 4194300 total, 3998236 free, 196064 used. 26111033+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 19514 lzc 20 0 158232 2672 1552 R 1.0 0.0 0:00.19 top 78 root 20 0 0 0 0 S 0.3 0.0 0:14.63 rcuos/19 1 root 20 0 196232 7792 2396 S 0.0 0.0 1:36.35 systemd 2 root 20 0 0 0 0 S 0.0 0.0 2:08.29 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 4:39.81 ksoftirqd/0 5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H 6 root 20 0 0 0 0 S 0.0 0.0 0:30.80 kworker/u96:0 8 root rt 0 0 0 0 S 0.0 0.0 1:25.26 migration/0 9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh

scotty323 commented 7 years ago

Another question, how to make sure the paired-end reads connetions are not from transposable elements or other repeats?

So for each connection between contigs, AGOUTI only use uniquely mapped paired-end reads?

svm-zhang commented 7 years ago

I believe your desktop server is more than capable for running AGOUTI on a BAM file of >= 36 GB.

Could you please go ahead giving it a try? If it runs very slow, let me see if I can come up with a quick patch to support reading BAMs in parallel.

svm-zhang commented 7 years ago

Reads from repetitive parts of a genome are not expected to be mapped uniquely. Even if they are, I think you could use mapping quality to control them. I haven't yet particularly looked at this, to be honest.

Currently yet. Only uniquely mapped paired-end reads are allowed.

scotty323 commented 7 years ago

Ok. thanks!

svm-zhang commented 7 years ago

Hello @scotty323,

I have implemented a beta version of AGOUTI that can take multiple BAM files. You can use -t argument to specify how many number of BAM files you want to read at the same time. Reading each BAM invokes one process for samtools, and one for the AGOUTI worker that reads the BAM. You can simply provide multiple BAM files after argument -bam, and each file is separated by one space.

Could you please pull down the "multibam" branch and give a try? Let me know how it goes.

Simo

svm-zhang commented 6 years ago

Close it for now. Reopen if issue persists