jbloomlab / dms_tools2

software for the analysis and visualization of deep mutational scanning data
GNU General Public License v3.0
29 stars 20 forks source link

Doud2016 example, error in dms2_bcsubamp #29

Closed grhuynh closed 4 years ago

grhuynh commented 4 years ago

I'm trying to run the Doud2016 jupyter notebook, and had 2 issues with running the dms2_batch_bcsubamp under the "Align the deep sequencing data" section.

First, when running within the jupyter notebook, no error was printed although the dms2_batch_bcsubamp was having an error. Then, I ran the dms2_batch_bcsubamp call directly from the command line, where now I am getting the following error in the log files for all the samples:

INFO - Read refseq of 1698 codons from notebooks/Doud2016/data/WSN-HA.fasta ERROR - Terminating dms2_bcsubamp with ERROR Traceback (most recent call last): File "/data/anaconda/envs/dms/bin/dms2_bcsubamp", line 130, in main (refseqstart, refseqend, r1start, r2start) = map(int, s.split(",")) ValueError: invalid literal for int() with base 10: '37 286'

I'm not sure how to interpret this error, since I don't think I made any edits to the jupyter notebook. Any ideas on how to interpret this error? Thanks!

jbloom commented 4 years ago

What is the full text of the command you typed at the prompt that gave you the second error? Basically, can you provide a minimal working example of what fails, such as a ZIP file with the input / output / bash command.

grhuynh commented 4 years ago

(dms) gracehuynh@IDRI-ms:/data/home/gracehuynh$ dms2_batch_bcsubamp --batchfile notebooks/Doud2016/results/codoncounts/batch.csv --refseq notebooks/Doud2016/data/WSN-HA.fasta --alignspecs '1,285,36,37 286,570,31,32 571,855,37,32 856,1140,31,36, 1141,1425,29,33 1426,1698,40,43' --outdir notebooks/Doud2016/results/codoncounts --summaryprefix summary --R1trim 200 --R2trim 170 --fastqdir notebooks/Doud2016/results/FASTQ_files/ --ncpus -1 --use_existing 'yes' INFO:dms2_batch_bcsubamp:Beginning execution of dms2_batch_bcsubamp in directory /data/home/gracehuynh

INFO:dms2_batch_bcsubamp:Progress is being logged to notebooks/Doud2016/results/codoncounts/summary.log INFO:dms2_batch_bcsubamp:Version information: Time and date: Tue Aug 20 21:48:53 2019 Platform: Linux-4.15.0-1050-azure-x86_64-with-debian-stretch-sid Python version: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0] dms_tools2 version: 2.5.0 Bio version: 1.74 HTSeq version: 0.11.2 pandas version: 0.25.0 numpy version: 1.16.4 IPython version: 7.7.0 jupyter version: 1.0.0 matplotlib version: 3.1.1 plotnine version: 0.5.1 natsort version: 6.0.0 pystan version: 2.16.0.0 scipy version: 1.2.2 seaborn version: 0.9.0 phydmslib version: 2.3.3 statsmodels version: 0.10.1 rpy2 cannot be imported regex version: 2.5.33 umi_tools version: 1.0.0

INFO:dms2_batch_bcsubamp:Parsed the following arguments: outdir = notebooks/Doud2016/results/codoncounts ncpus = -1 use_existing = yes refseq = notebooks/Doud2016/data/WSN-HA.fasta alignspecs = ['1,285,36,37 286,570,31,32 571,855,37,32 856,1140,31,36, 1141,1425,29,33 1426,1698,40,43'] bclen = 8 fastqdir = notebooks/Doud2016/results/FASTQ_files/ R2 = None R1trim = [200] R2trim = [170] bclen2 = None chartype = codon maxmuts = 4 minq = 15 minreads = 2 minfraccall = 0.95 minconcur = 0.75 sitemask = None purgeread = 0 purgebc = 0 bcinfo = False batchfile = notebooks/Doud2016/results/codoncounts/batch.csv summaryprefix = summary

INFO:dms2_batch_bcsubamp:Parsing sample info from notebooks/Doud2016/results/codoncounts/batch.csv INFO:dms2_batch_bcsubamp:Read the following sample information: name,R1 mutDNA-1,mutDNA-1_R1.fastq.gz mutDNA-2,mutDNA-2_R1.fastq.gz mutDNA-3,mutDNA-3_R1.fastq.gz mutvirus-1,mutvirus-1_R1.fastq.gz mutvirus-2,mutvirus-2_R1.fastq.gz mutvirus-3,mutvirus-3_R1.fastq.gz wtDNA,wtDNA_R1.fastq.gz wtvirus,wtvirus_R1.fastq.gz

INFO:dms2_batch_bcsubamp:Running dms2_bcsubamp on all samples using 4 CPUs... INFO:dms2_batch_bcsubamp:Completed runs of dms2_bcsubamp.

ERROR:dms2_batch_bcsubamp:Terminating dms2_batch_bcsubamp with ERROR Traceback (most recent call last): File "/data/anaconda/envs/dms/bin/dms2_batch_bcsubamp", line 152, in main '\n'.join(logfiles.values))) AssertionError: Did not create all these files: notebooks/Doud2016/results/codoncounts/mutDNA-1_codoncounts.csv notebooks/Doud2016/results/codoncounts/mutDNA-2_codoncounts.csv notebooks/Doud2016/results/codoncounts/mutDNA-3_codoncounts.csv notebooks/Doud2016/results/codoncounts/mutvirus-1_codoncounts.csv notebooks/Doud2016/results/codoncounts/mutvirus-2_codoncounts.csv notebooks/Doud2016/results/codoncounts/mutvirus-3_codoncounts.csv notebooks/Doud2016/results/codoncounts/wtDNA_codoncounts.csv notebooks/Doud2016/results/codoncounts/wtvirus_codoncounts.csv

Look in following log files for details of what went wrong: notebooks/Doud2016/results/codoncounts/mutDNA-1.log notebooks/Doud2016/results/codoncounts/mutDNA-2.log notebooks/Doud2016/results/codoncounts/mutDNA-3.log notebooks/Doud2016/results/codoncounts/mutvirus-1.log notebooks/Doud2016/results/codoncounts/mutvirus-2.log notebooks/Doud2016/results/codoncounts/mutvirus-3.log notebooks/Doud2016/results/codoncounts/wtDNA.log notebooks/Doud2016/results/codoncounts/wtvirus.log

jbloom commented 4 years ago

I can't troubleshoot without having access to the actual input / output files. Do you want to make a minimal example, such as with just one or two samples and clipped small FASTQ files, and then send that with the exact commands you ran? Without being able to try to reproduce what you are running, I can't determine if it is some bug in the program or just some issue with your installation / computer.

grhuynh commented 4 years ago

Sure. I actually am exactly running the Doud2016 example. I figured out that the error is because there shouldn't be single quotation marks around the numbers for the subamplicon alignment specs, so it's trying to run now. Thanks for your help on this!

Second question, do you have a general sense of how much compute is needed? How many CPUs did your team use and how long should it take for the dms2_batch_bcsubamp call in the Doud2016 jupyter notebook? Is it realistic to think this can be run on 4 cores on a virtual machine, or do I need to run this on a cluster? Thanks!

jbloom commented 4 years ago

Great, so are the quotes this a bug in the Jupyter notebook that I should fix?

It will not take that long on a four-CPU machine. Downloading the FASTQ files from the SRA will take a while, but the rest will take probably less than an hour. Note it does require quite a bit of RAM.

grhuynh commented 4 years ago

I'm not sure about the quotes - I wasn't ever able to get it to run from the Jupyter notebook. When I ran it directly from the command line I used the values without any quotes as such: alignspecs = 1,285,36,37 286,570,31,32 571,855,37,32 856,1140,31,36 1141,1425,29,33 1426,1698,40,43

Also, fyi when it completed I did see several warnings (sample below), which might be due to my own matplotlib installation, but just wanted to put that out there. Thanks for your help! /data/anaconda/envs/dms/lib/python3.6/site-packages/plotnine/scales/scale.py:93: MatplotlibDeprecationWarning: The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead. if cbook.iterable(self.breaks) and cbook.iterable(self.labels): /data/anaconda/envs/dms/lib/python3.6/site-packages/plotnine/utils.py:553: MatplotlibDeprecationWarning: The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead. return cbook.iterable(var) and not is_string(var)

jbloom commented 4 years ago

OK, thanks.

I'm just going to close this as the source of the problem isn't clear.

The deprecation warnings are from plotnine, not dms_tools2. Probably if you upgrade to the newest plotnine they will go away (do pip install plotnine --upgrade --upgrade-strategy only-if-needed).