kbaseattic / assembly

An extensible framework for genome assembly.
MIT License
12 stars 14 forks source link

implement rolling QC (RQC) pipeline #50

Closed sebhtml closed 10 years ago

sebhtml commented 10 years ago

rolling QC (RQC) pipeline

sebhtml commented 10 years ago

./scripts/add-comp.pl -t /space2/seb/tmp -d /space2/seb/bin jgi_rqc

sebhtml commented 10 years ago

[seb@sal assembly]# ./scripts/add-comp.pl -t /space2/seb/tmp -d /space2/seb/bin jgi_rqc Installing jgi_rqc... Cloning into 'jgi-rqc-pipeline'... remote: Counting objects: 3033, done. remote: Compressing objects: 100% (1510/1510), done. remote: Total 3033 (delta 1480), reused 3033 (delta 1480) Receiving objects: 100% (3033/3033), 160.73 MiB | 5.10 MiB/s, done. Resolving deltas: 100% (1480/1480), done.

sebhtml commented 10 years ago

next: identify the entry point for rqc

sebhtml commented 10 years ago

mkdir -p destination ; rm -rf tmp; ./scripts/add-comp.pl -t tmp -d destination jgi_rqc

sebhtml commented 10 years ago

mkdir -p destination ; rm -rf tmp; ./scripts/add-comp.pl -t tmp -d $(pwd)/destination jgi_rqc

sebhtml commented 10 years ago

[seb@sal readqc]# ls lib readqc.py readqc_report.py tools [seb@sal readqc]# ./readqc.py Traceback (most recent call last): File "./readqc.py", line 70, in from readqc_utils import File "./lib/readqc_utils.py", line 22, in from db_access import File "./../lib/db_access.py", line 13, in import MySQLdb ImportError: No module named MySQLdb

sebhtml commented 10 years ago

http://stackoverflow.com/questions/22252397/importerror-no-module-named-mysqldb

sebhtml commented 10 years ago

./destination/jgi_rqc/readqc/readqc.py --fastq ~/dropbox/GPIC.1424-1.1371.fastq --output-path output-1 --kmer 63

sebhtml commented 10 years ago

`-bash---python---sh---perl---cat

sebhtml commented 10 years ago

it is running now...

sebhtml commented 10 years ago

[seb@sal assembly]# ls output-1/ readqc.log readqc_stats.tmp readqc_status.log subsample uniqueness

sebhtml commented 10 years ago

[seb@sal assembly]# grep -i fail output-1/readqc.log os_utility.py :31732 2014-05-30 16:33:15,734 INFO: cmd: set -e; cat /home/seb/dropbox/GPIC.1424-1.1371.fastq | Failed to find 'cplusmersampler' installation. os_utility.py :31732 2014-05-30 16:33:15,742 INFO: Return values: exitCode=127, stdOut=, stdErr=/bin/sh: 1: Failed: not found readqc_utils.py:31732 2014-05-30 16:33:15,742 ERROR: - fail to sample unique 20 mers. readqc.py :31732 2014-05-30 16:33:15,742 INFO: 2_unique_mers_sampling failed. readqc.py :31732 2014-05-30 16:33:15,743 INFO: Status 2_unique_mers_sampling failed

what is cplusmersampler ?

sebhtml commented 10 years ago

The asset (private) is now available here­ (executable, no source code):

https://bitbucket.org/sebhtml/jgi-assets/src

sebhtml commented 10 years ago

To install:

seb@bigmem:~/kbase-stuff/assembly$ rm -rf destination; mkdir -p destination ; rm -rf tmp; ./scripts/add-comp.pl -t tmp -d $(pwd)/destination jgi_rqc

To use:

./destination/jgi_rqc/readqc/readqc.py --fastq ~/dropbox/GPIC.1424-1.1371.fastq --output-path output-1 --kmer 63

sebhtml commented 10 years ago

Success:

seb@bigmem:~/kbase-stuff/assembly$ ./destination/jgi_rqc/readqc/readqc.py --fastq ~/dropbox/GPIC.1424-1.1371.fastq --output-path output-1 --kmer 63 Started readqc pipeline, writing log to: output-1/readqc.log seb@bigmem:~/kbase-stuff/assembly$ find output-1/ output-1/ output-1/uniqueness output-1/readqc_status.log output-1/readqc_stats.tmp output-1/readqc.log output-1/subsample output-1/subsample/GPIC.1424-1.1371.s0.01.fastq output-1/subsample/first_subsampled.txt output-1/subsample/GPIC.1424-1.1371.stats

sebhtml commented 10 years ago

https://github.com/sebhtml/assembly/commit/6226952f40dc14882e13d699e54484dc1cac608e

sebhtml commented 10 years ago

The program still does not find its own program:

seb@bigmem:~/kbase-stuff/assembly$ grep cplusmersampler output-1/readqc.log os_utility.py :887 2014-06-11 17:46:59,609 INFO: cmd: set -e; cat /home/seb/dropbox/GPIC.1424-1.1371.fastq | Failed to find 'cplusmersampler' installation.

seb@bigmem:~/kbase-stuff/assembly$ find destination/|grep cplusmersampler destination/jgi_rqc/readqc/tools/cplusmersampler

seb@bigmem:~/kbase-stuff/assembly$ file destination/jgi_rqc/readqc/tools/cplusmersampler destination/jgi_rqc/readqc/tools/cplusmersampler: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.8, not stripped

Will continue this tomorrow.

sebhtml commented 10 years ago

OK...

Setting the PATH does not work.

seb@bigmem:~/kbase-stuff/assembly$ export PATH=$(pwd)/destination/jgi_rqc/assets/:$PATH

seb@bigmem:~/kbase-stuff/assembly$ which cplusmersampler /home/seb/kbase-stuff/assembly/destination/jgi_rqc/assets//cplusmersampler

seb@bigmem:~/kbase-stuff/assembly$ ./destination/jgi_rqc/readqc/readqc.py --fastq ~/dropbox/GPIC.1424-1.1371.fastq --output-path output-1 --kmer 63

os_utility.py :9059 2014-06-13 16:01:13,766 INFO: cmd: set -e; cat /home/seb/dropbox/GPIC.1424-1.1371.fastq | Failed to find 'cplusmersampler' installation.

sebhtml commented 10 years ago

there are some tests that ship with the product:

seb@bigmem:~/kbase-stuff/assembly$ destination/jgi_rqc/lib/os_utility.py &> log

..

Ran 2 tests in 0.015s

OK Failed to find 'blast' installation.

Failed to find 'blast' installation.

Failed to find 'blast' installation.

Failed to find 'blast' installation.

Failed to find 'blast+' installation.

Failed to find 'agrep' installation.

Failed to find 'tagdust' installation.

Failed to find 'gnuplot' installation.

Failed to find 'bwa' installation.

Failed to find 'fastq_to_fasta_qual' installation.

Failed to find 'cat' installation.

Failed to find 'bzcat' installation.

Failed to find 'zcat' installation.

Failed to find 'head' installation.

Failed to find 'tail' installation.

Failed to find 'perl' installation.

Failed to find 'grep' installation.

Failed to find 'nawk' installation.

Failed to find 'xxx' installation.

Failed to find 'fastqTrimmer' installation.

Failed to find 'duk' installation.

Failed to find 'fastqQhist' installation.

Failed to find 'histo_parse.pl' installation.

Failed to find 'rm' installation. -f Failed to find 'mkdir' installation. -p Failed to find 'cplusmersampler' installation.

Failed to find 'fq2fa.pl' installation.

Failed to find 'GCcontent.pl' installation.

Failed to find 'histogram2.pl' installation.

Failed to find 'histo_parse.pl' installation.

Failed to find 'checkIllQualLANLfq.sh' installation.

I think it is broken.

levinas commented 10 years ago

It seems getToolPath() in lib/os_utility2.py is used for getting the path, and there are many hardcoded paths the function looks for the executable.

Looking at readqc/lib/readqc_constants.py, I feel we will have many more dependencies in the form of hardcoded reference databases (/global/dna/shared/rqc/ref_databases/*).

On Jun 13, 2014, at 11:03 AM, Sébastien Boisvert notifications@github.com wrote:

OK...

Setting the PATH does not work.

seb@bigmem:~/kbase-stuff/assembly$ export PATH=$(pwd)/destination/jgi_rqc/assets/:$PATH

seb@bigmem:~/kbase-stuff/assembly$ which cplusmersampler /home/seb/kbase-stuff/assembly/destination/jgi_rqc/assets//cplusmersampler

seb@bigmem:~/kbase-stuff/assembly$ ./destination/jgi_rqc/readqc/readqc.py --fastq ~/dropbox/GPIC.1424-1.1371.fastq --output-path output-1 --kmer 63

os_utility.py :9059 2014-06-13 16:01:13,766 INFO: cmd: set -e; cat /home/seb/dropbox/GPIC.1424-1.1371.fastq | Failed to find 'cplusmersampler' installation.

— Reply to this email directly or view it on GitHub.

sebhtml commented 10 years ago

This should be fixed upstream in my opinion.

levinas commented 10 years ago

I agree. It needs to be deployable.

On Jun 13, 2014, at 11:22 AM, Sébastien Boisvert notifications@github.com wrote:

This should be fixed upstream in my opinion.

— Reply to this email directly or view it on GitHub.

sebhtml commented 10 years ago

The product needs this dependency:

http://modules.sourceforge.net/

sebhtml commented 10 years ago

instructions http://nickgeoghegan.net/linux/installing-environment-modules

sebhtml commented 10 years ago

This Python code is not compatible with Ubuntu because:

Python's Popen has an option to use the shell. Default is '/bin/sh'. On Fedora / CentOS / RHEL, /bin/sh is bash (/bin/sh -> bash). On Ubuntu, it is dash (/bin/sh -> dash)

The problem is that modules is not compatible with dash.

Workaround:

try to use this:

Popen(['/bin/bash', '-c', args[0], args[1], ...])

command = ['/bin/bash', '-c', 'module load cplusmersampler && which cplusmersampler']

But to have access to module, it is required to have access to /packages/modules/3.2.9-1/Modules/3.2.9/init/bash

I'll patch the code so that module is not a requirement...

sebhtml commented 10 years ago

The code is looking for 'command not found', but dash just says 'not found'.

seb@bigmem:~/kbase-stuff/assembly$ dash -c command-12345678 dash: 1: command-12345678: not found seb@bigmem:~/kbase-stuff/assembly$ bash -c command-12345678 bash: command-12345678: command not found

sebhtml commented 10 years ago

running test.

sebhtml commented 10 years ago

Still does not work after patching...

It is in my PATH: seb@bigmem:~/kbase-stuff/assembly$ which cplusmersampler /home/seb/kbase-stuff/assembly/destination/jgi_rqc/assets//cplusmersampler

The unit test finds the executable: seb@bigmem:~/kbase-stuff/assembly$ destination/jgi_rqc/lib/os_utility.py | grep cplus

..

Ran 2 tests in 0.016s

OK /home/seb/kbase-stuff/assembly/destination/jgi_rqc/readqc/tools/cplusmersampler

seb@bigmem:~/kbase-stuff/assembly$ grep cplus output-1/readqc.log os_utility.py :21452 2014-06-13 17:19:09,884 INFO: cmd: set -e; cat /home/seb/dropbox/GPIC.1424-1.1371.fastq | Failed to find 'cplusmersampler' installation.

sebhtml commented 10 years ago

Oh, there are 2 copies of the os_library Python code:

seb@bigmem:~/kbase-stuff/assembly$ find destination/|grep os_utility|grep py$ destination/jgi_rqc/lib/os_utility2.py destination/jgi_rqc/lib/os_utility.py

seb@bigmem:~/kbase-stuff/assembly$ sha1sum destination/jgi_rqc/lib/os_utility.py destination/jgi_rqc/lib/os_utility2.py b62fb21da0f2f1e0589b920cc238b8e7fd9f34ad destination/jgi_rqc/lib/os_utility.py 8e557f60a306577e7369c7252ba1ca7e8d0538e6 destination/jgi_rqc/lib/os_utility2.py

I supppose that readqc uses os_utility2.py while the unit test uses os_utility.py.

sebhtml commented 10 years ago

Yup ;-).

I need to patch os_utility2 too.

seb@bigmem:~/kbase-stuff/assembly$ grep os_utility2 destination/* -R|grep -v Binary destination/jgi_rqc/readqc/readqc.py: 20130415 5.0.8: Cleanup; os_utility2.py; destination/jgi_rqc/readqc/lib/readqc_constants.py:from os_utility2 import getToolPath destination/jgi_rqc/lib/rqc_constants.py:from os_utility2 import getToolPath destination/jgi_rqc/lib/rqc_constants.py:from os_utility2 import getToolPath

sebhtml commented 10 years ago

seb@bigmem:~/kbase-stuff/assembly$ destination/jgi_rqc/lib/os_utility2.py | grep cplus


Ran 0 tests in 0.000s

OK Failed to find 'cplusmersampler' installation.

sebhtml commented 10 years ago

cplusmersample now works.

new error: Failed to find 'gnuplot' installation

sebhtml commented 10 years ago

15 steps:

    ## 1. fast_subsample_fastq_sequences
    ## 2. write_unique_20_mers
    ## 3. generate read GC histograms: illumina_read_gc
    ## 4. read_quality_stats
    ## 5. write_base_quality_stats
    ## 6. illumina_count_q_score
    ## 7. illumina_calculate_average_quality
    ## 8. illumina_find_common_motifs
    ## 9. illumina_run_bwa
    ## 10. illumina_run_tagdust
    ## 11. illumina_detect_read_contam
    ## 12. illumina_sciclone_analysis
    ## 13. illumina_read_megablast
    ## 14. multiplex_statistics
    ## 15. end_of_read_illumina_adapter_check
sebhtml commented 10 years ago

It fails at step 9:

Traceback (most recent call last): File "/usr/lib/python2.7/logging/init.py", line 846, in emit msg = self.format(record) File "/usr/lib/python2.7/logging/init.py", line 723, in format return fmt.format(record) File "/usr/lib/python2.7/logging/init.py", line 464, in format record.message = record.getMessage() File "/usr/lib/python2.7/logging/init.py", line 328, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Logged from file readqc.py, line 625 chmod: cannot access `output-2/log': No such file or directory Traceback (most recent call last): File "./destination/jgi_rqc/readqc/readqc.py", line 1495, in status = do_illumina_run_bwa(first_sub_fastq, log) File "./destination/jgi_rqc/readqc/readqc.py", line 685, in do_illumina_run_bwa ret, bwa_file = illumina_run_bwa(fastq, log) File "./destination/jgi_rqc/readqc/lib/readqc_utils.py", line 1360, in illumina_run_bwa chmod(log_path, "0755") File "./destination/jgi_rqc/readqc/../lib/os_utility.py", line 223, in chmod run(["chmod"]+opts.split()+[mode]+path, dryRun=dryRun) File "./destination/jgi_rqc/readqc/../lib/os_utility.py", line 66, in run raise CalledProcessError(returncode=returncode, cmd=str(popenargs)) subprocess.CalledProcessError: Command '(['chmod', '0755', 'output-2/log'],)' returned non-zero exit status 1

sebhtml commented 10 years ago

The tool creates the directory output-2/output-2/log/, but then uses output-2/log/

sebhtml commented 10 years ago

Test result after adding 3rd patch:

Crashing at 12_illumina_sciclone_analysis

sebhtml commented 10 years ago

It is trying to connect to a MySQL server:

^CTraceback (most recent call last): File "./destination/jgi_rqc/readqc/readqc.py", line 1544, in status = do_illumina_sciclone_analysis(first_sub_fastq, fastq, log, lib_name=lib_name, is_rna=is_rna) File "./destination/jgi_rqc/readqc/readqc.py", line 889, in do_illumina_sciclone_analysis ret, strandedness_output_log, DNA_COUNT_FILE, RNA_COUNT_FILE, outResultDict = illumina_sciclone_analysis(origFastq, is_pe, log, lib_name=lib_name, is_rna=is_rna) File "./destination/jgi_rqc/readqc/lib/readqc_utils.py", line 1632, in illumina_sciclone_analysis x, y, libName, is_rna = get_lib_info(seq_unit_name, log) File "./destination/jgi_rqc/readqc/lib/readqc_utils.py", line 3001, in get_lib_info db = db_connect(db_server = "##############", db_name = "rqc") File "./destination/jgi_rqc/readqc/../lib/db_access.py", line 52, in db_connect db = MySQLdb.connect(host = db_server, user = db_user, passwd = db_pwd, db = db_name, charset = "utf8", use_unicode = True) KeyboardInterrupt

sebhtml commented 10 years ago

The upstream code has too many dependencies and is not properly documented.