pughlab / bamgineer

Bamgineer: Introduction of simulated allele-specific copy number variants into exome and targeted sequence data sets
Apache License 2.0
37 stars 14 forks source link

Thread error running bamgineer #10

Open pcgen1 opened 5 years ago

pcgen1 commented 5 years ago

Getting error running the bamgineer tool. Seems to be with respect to the multiprocessing module. I also tried to use the older version of multiprocessing module ( (0.70.4, as suggested on online forums for such a python error; seems to be a common error). Still no luck in getting bamgineer to work through it. Could you suggest a solution to it? Please find the error log below:

generating phased bed _ filtering bed file columns for amp4AABB47974300_tmp2.bed _ Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 801, in bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 754, in run self.target(*self.args, **self.kwargs) File "/mnt/DataDisk/NGS_tools/bamgineer/src/helpers/handlers.py", line 76, in receive record = self.queue.get(True, self.polltime) File "/usr/lib/python2.7/multiprocessing/queues.py", line 135, in get res = self._recv() TypeError: init() takes exactly 2 arguments (1 given)

suluxan commented 5 years ago

Hey, Could you post your config.cfg file? Are you working locally or on a cluster as there is a Dockerfile to get up and running.

pcgen1 commented 5 years ago

Please find the config.cfg pasted below. Am working on a cloud instance. Not using docker currently for this tool as it gave separate issues earlier (hard to describe all here). Running bamgineer locally on the cloud instance (have installed all dependencies locally).

[SOFTWARE] java =/usr/bin/java gatk =/mnt/DataDisk/Bamgineer/Jar/GenomeAnalysisTK.jar java_path =/usr/bin/java beagle_path =/mnt/DataDisk/Bamgineer/beagle/beagle.28Sep18.793.jar samtools_path =/usr/local/bin/samtools vcftools_path =/usr/local/bin/vcftools bedtools_path =/usr/local/bin/bedtools sambamba_path =/usr/local/bin/sambamba picard_path =/mnt/DataDisk/Bamgineer/Jar/picard.jar

[REFERENCE] reference_path =/mnt/DataDisk/Bamgineer/human_g1k_v37_decoy.chr.fasta vcf_path =/mnt/DataDisk/VCFs/variants_haplotype_caller_C12878W.noIndels.vcf.recode.HET.noX_Y.phased.chr.vcf.gz exons_path =/mnt/DataDisk/Resources/Beds/Regions.chr21.bed

[RESULTS] results_path =/mnt/DataDisk/Bamgineer

suluxan commented 5 years ago

What version of the multiprocessing package gave you the error? The version of multiprocessing we have on our cluster is 0.70a1. From the documentations it looks like the latest version (0.70.7) is a fork of 0.70a1 (https://pypi.org/project/multiprocess/0.70.7)

suluxan commented 5 years ago

Hey, I've updated multiprocessing to now use multiprocess 0.70.7 (pip install multiprocess==0.70.7). Please pull from the latest version of bamgineer and let me know if you have any issues with it.

pcgen1 commented 5 years ago

Sure. Would let you know. Thank you!

pcgen1 commented 5 years ago

Hi suluxan, The program is running fine now, but has been running for almost 3 hours with a small bam (input) containing only chr21 and chr22 regions, a 'splitbam' directory containing chr21.bam , chr21.byname.bam, chr22.bam and chr22.byname.bam, AND a cnv file containing only 1 amp (cn=4) for 1 region of chr21. The script goes upto the step of creating a chr21_roiamp4AABB47974300.bam file under "tmpbams" but seems to be taking a good amount of time for creating the final simulated bam. Could you help to see if something is going wrong here. Please see the command-line, bed and logs pasted below. The config file is same as posted in this thread earlier. (Note: I do give a phased vcf consisting only chr21 phased variants to bamgineer. The phased vcf was created by running the beagle tool ahead of running bamgineer {due to some issues we faced earlier while running beagle as a part of bamgineer workflow earlier; not necessary to discuss at the moment} )

Command line: simulate.py -inbam ~/DataDisk/VCFs/C12878W.21_and_22.bam -outbam ~/DataDisk/Bamgineer/C12878W.21_and_22.bamgineer.bam -cnv_bed ~/DataDisk/VCFs/cnv_of_interest.bed -config ~/DataDisk/Bamgineer/config.cfg -splitbamdir ~/DataDisk/VCFs/splitBam > ~/DataDisk/Bamgineer/C12878W.21_22.bamgineer.log 2>&1 &

cnv bed file: chr21 47974300 47974590 AABB 4

Logs: a) _C12878W.2122.bamgineer.log /mnt/DataDisk/Bamgineer generating phased bed _ filtering bed file columns for amp4AABB47974300tmp2.bed

b)debug.log pipeline started! --- Initializing input files --- --- initialization complete ---

Do you expect simulate.py to take this much time with such a small bam? If "yes", then, does a multithread parameter exist which could make simulate.py run faster on a single instance? I did not see such parameter in the "help" section.

suluxan commented 5 years ago

Yeah, the previous steps of Bamgineer v1 for phasing were not that clear; it seems Beagle needs population data to phase correctly. Is that how you generated your VCF? I have been running Bamgineer v2 with properly phased VCFs (from 10x) and was working on a change to make it much faster (to only use "PASS" variants) but I am working on the benchmarking. I will push that change now and you can let me know if it helps.

It should not take that long to get the ROI bam especially considering how small the cnv is. Although bamgineer v2 is capable of such focal alterations, I would recommend a couple Kb in order to get a decent amount of reads in the ROI bam.

suluxan commented 5 years ago

Regarding the multithreading comment, once we update the pysam/samtools versions we will be able to take advantage of the multithreading. The current samtools version that we use (1.2) does not support multithreading.

Also, try pulling from the latest version now and let me know if you have the same problem.

pcgen1 commented 5 years ago

suluxan, I am getting ROI bam (in tmpbam folder) no problem, but not getting the final bam in the finalbam folder. I believe that the finalbam folder would contain the bam simulated with the CNVs...am I right?

suluxan commented 5 years ago

What are the other files in the tmpbams directory?

pcgen1 commented 5 years ago

Just one: chr21_roiamp4AABB47974300.bam

pcgen1 commented 5 years ago

That tmp bam has around 282 reads

suluxan commented 5 years ago

Try the latest version I just pushed, the ROI should generate much faster.

pcgen1 commented 5 years ago

Suluxan, ROI is indeed getting generated faster. The problem is that the python script is still running. And I see no final bam generated in finalbam folder. I believe that final bam should be the one that actually contains the simulated cnv... Am I right?

pcgen1 commented 5 years ago

Let me try the new version anyways..

pcgen1 commented 5 years ago

Suluxan, the new version generates the ROI bam at the same speed as the previous version, but gives back the multiprocessing module error which you already fixed last Friday. And moreover, the issue still remains: The script is still running and the final bam not being generated.

See the throwback for the multiprocessing error below: /mnt/DataDisk/Bamgineer generating phased bed _ filtering bed file columns for amp4AABB47974300_tmp2.bed _ Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 801, in bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 754, in run self.target(*self.args, **self.kwargs) File "/mnt/DataDisk/NGS_tools/bamgineer/src/helpers/handlers.py", line 76, in receive record = self.queue.get(True, self.polltime) File "/usr/local/lib/python2.7/dist-packages/multiprocess-0.70.7-py2.7-linux-x86_64.egg/multiprocess/queues.py", line 138, in get res = self._recv() File "/home/ubuntu/.local/lib/python2.7/site-packages/dill/dill.py", line 299, in loads return load(file) File "/home/ubuntu/.local/lib/python2.7/site-packages/dill/dill.py", line 288, in load obj = pik.load() File "/usr/lib/python2.7/pickle.py", line 864, in load dispatchkey File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce value = func(*args) TypeError: init() takes exactly 2 arguments (1 given)

pcgen1 commented 5 years ago

One more point to add is: The samtools version that bamgineer using on my instance is 0.1.18. This was kept consistent with what you mentioned in the example config file. Do you think updating that to 1.2 might make a difference speedwise? (Given that you already mentioned that 1.2 is slow). Regardless, I think I should be consistent with 1.2 version just to compare apples to apples...let me do that (not with the new version of bamgineer but the old version {because of the mutliprocessing issue that I just mentioned a min ago}....

pcgen1 commented 5 years ago

Ok, so rerunning the bamgineer with samtools 1.2 version. The ROI bam got generated in the "tmpbams" directory in a second. The script is still running. I would wait to see if it generates the final bam by the end of the day. Pls note that the input bam still contains only two chromososmes (21 and 22) and the cnv bed has only one region from chr21 in it (same as before). And the phased vcf contains only chr21 variants (indeed passed ones) (same as before). I have -splitbams directory option activated (same as before) (note: splibam dir contains bams for chr21 and chr22; naming of these bams matches what you specified in your manual) Also, note that this is the version of bamgineer I pulled Last friday after you fixed the multiprocessing-module-error.

suluxan commented 5 years ago

Okay, couple of things:

I presume the tool stops running due to the starred points above.

suluxan commented 5 years ago

A lot of the dependency issues were supposed to be solved through the Dockerfile... any reasons why it initially failed? Considering you are on a cloud environment it would be the optimal route to go.

Also, I will get a bamgineer image to the dockerhub by tonight or tomorrow so it will be easy to just pull from there.

pcgen1 commented 5 years ago

May I know which version of bamUtil do you prefer?

suluxan commented 5 years ago

Please check the dockerfile (bamgineer/docker-example/Dockerfile) for install instructions and versions. We have tested bamgineer with bamUtil/1.0.14. I am working on getting the image to a docker repo as well as updating the documentation and I will let you know when those are available. Thanks.

pcgen1 commented 5 years ago

Ok, I was looking into config.cfg under bamgineer/docker-example/inputs folder for versioning info. Thanks for correcting me. I am indeed using bamUtil/1.0.14. Glad to know you recommend the same. At this point, I have all versions of all tools setup appropriately on my cloud instance. I would also test the docker container once you have it up on docker repo. We use singularity engine on our instance. So, would need to convert your docker container to singularity. That is what I did earlier too, but the issue I faced with your docker container seemed to be less related to its compability with singularity and more related to the internal (default) environment in the container itself. Singularity does not make significant changes to the default environment in the docker containers (based on my experience converting docker containers to singularity ones and using them with singularity engine). They usually run well through singularity engine. Let's see how the new docker container (that you would be uploading soon to docker repo) performs. Please let me know when ready.

pcgen1 commented 5 years ago

FYI: I was using a docker container from this account earlier : https://hub.docker.com/r/virenar/bamgineer. Doesnt look like your account..

pcgen1 commented 5 years ago

Suluxan, does bamgineer delete the tmp bams in tmpbam dir after the execution completes? I see that the execution completed (no python script running under "top" output), but there is no finalbam generated. Also , does the bedtool.log gets deleted as well? FYI I updated my cnv.bed, exons.bed and phased vcfs to include entries for two chromosomes, i,e. 21 and 22, instead of just one (i.e. 21). I see the script execution ended with following lines in the log file but no final bam created.

/mnt/DataDisk/Bamgineer generating phased bed _ filtering bed file columns for amp4AABB47974300_tmp2.bed __ filtering bed file columns for gainAAB18300720_tmp2.bed ___

Please note that am using -splitBamDir option with the following files in my splitBam dir: chr21.bam chr21.bam.bai chr21.byname.bam chr22.bam chr22.bam.bai chr22.byname.bam

pcgen1 commented 5 years ago

Which version of pathos and pandas you recommend? Not clear from the DockerFile. Also, is specifying -cancertype necessary? Could I mention -cancertype as None..? Currently, am not at all using -cancertype argument on the command-line. We focus on germline analysis currently.

suluxan commented 5 years ago

That docker container was not from us. It is from a user. I did not remove tmpbams for debugging purposes. Your output should look something like this: generating phased bed _ filtering bed file columns for amp4AAAB30227447_tmp2.bed __ extracting roi bams splitting original bam into hap1 and hap2 re-pairing hap1 bam reads removing repaired duplicates re-pairing hap2 bam reads extracting non-roi bams removing repaired duplicates removing hap1 merged normal duplicates removing hap2 merged normal duplicates removing merged duplicates near breakpoints ___

I am updating the documentation, pathos is no longer necessary since we have updated multiprocessing to multiprocess. For pandas, I am on 0.20.2 but it should not matter. The image should solve all dependency issues.

The "-cancertype" is not necessary, it just organizes the output bam directories into a cancer type directory.

pcgen1 commented 5 years ago

Thanks so much suluxan. I would try out the container..:)

pcgen1 commented 5 years ago

Couldnt find your docker image on dockerhub. Sorry I thought you had already uploaded. Any estimated ETA that you could give would be great.

suluxan commented 5 years ago

Ah sorry I have been working on other things. At the latest I will have it up for you by tomorrow. Will let you know as soon as I do; thanks!

pcgen1 commented 5 years ago

Thanks!

pcgen1 commented 5 years ago

Hi, suluxan, could you paste the command line from your most recent bamgineer run? Thanks.

suluxan commented 5 years ago

The docker image is available at suluxan/bamgineer. You can use singularity to build it with singularity build bamgineer.simg docker://suluxan/bamgineer:initial

The tools in the configfile in bamgineer/docker-example/inputs are linked to the image itself so they require no changes. Just mount or move your files into the container and point to them in the config file and the python script and run!

pcgen1 commented 5 years ago

Sure, thanks suluxan!

pcgen1 commented 5 years ago

It worked suluxan! Thank you Note: Use --sandbox on singularity build command line to be able to do modifications to the file inside the image/container directory (for example: the config that you are talking about). I always do that when I want to see what is entailed inside a container and/or modify configs therein like yours.

pcgen1 commented 5 years ago

Hi suluxan, Does the exon bed needs to be 0-based or 1-based?

suluxan commented 5 years ago

1-based but it is a whole genome start and end coordinate i.e. chr21 1 48129895 for hg19. The exons.bed name convention was kept from the previous version.

pcgen1 commented 5 years ago

awesome..thanks!

pcgen1 commented 5 years ago

Hi suluxan, It seems bamgineer requires "chr" text for chromosome names in bams, beds, etc.. For example: if chromosomes are named just "1","2","3",etc. , bamgineer would not move forward. Could you fix that for us so that we do not have to worry about converting our bams to match with "chr" naming convention. We use gatk-broad/ncbi reference genome in our pipeline as opposed to ucsc ones, so do not have "chr" text prefixed to our chromosome names. And converting the bams later to match with those chromosome names is a pain in neck.

suluxan commented 5 years ago

Hey, Sorry for getting back so late, I've been away the past few weeks. I believe this has to do with pysam because there aren't any restrictions in the bamgineer code to using the chr naming conventions. I will investigate some more and get back to you about this. Thanks.

pcgen1 commented 5 years ago

Thanks. Please let me know..