heathsc / gemBS

gemBS is a bioinformatics pipeline designed for high throughput analysis of DNA methylation from Whole Genome Bisulfite Sequencing data (WGBS).
GNU General Public License v3.0
32 stars 21 forks source link

gemBS index fails at 50% #61

Closed paul-sud closed 5 years ago

paul-sud commented 5 years ago

Hello,

I am trying to create a gemBS index for GRCh38, specifically this fasta: https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/ . However, it always fails at exactly 50%. I didn't have this issue indexing the test data from http://statgen.cnag.cat/GEMBS/UserGuide/_build/html/example.html . The configuration I used here is very similar. Unfortunately the traceback isn't very informative, the relevant logs are shown below.

I should add, I don't believe it to be a resource issue. I ran it with 8 cores, 12 GB RAM, and 200 GB disk space, and post mortem didn't reveal any concerning usage levels.

: 
: Command index started at 2019-08-26 21:01:56.662514
: 
2019-08-26 21:01:56,689 DEBUG: Using bundled binary : /root/.local/lib/python3.5/site-packages/gemBS/gemBSbinaries/gemBS_cat
2019-08-26 21:01:56,690 DEBUG: Using binary from PATH: /usr/bin/pigz
2019-08-26 21:01:56,690 INFO: Starting:
    /root/.local/lib/python3.5/site-packages/gemBS/gemBSbinaries/gemBS_cat reference/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz reference/lambda.fa.gz | /usr/bin/pigz
2019-08-26 21:01:56,690 DEBUG: Starting subprocess
2019-08-26 21:01:56,693 DEBUG: File output detected, opening output stream to indexes/GRCh38_no_alt_analysis_set_GCA_000001405.15.BS_gemBS.tmp.gz
2019-08-26 21:01:56,693 DEBUG: Setting process input to parent output
2019-08-26 21:01:56,693 DEBUG: Starting subprocess
2019-08-26 21:03:48,488 DEBUG: Process '/usr/bin/pigz' finished with 0
2019-08-26 21:03:48,488 DEBUG: Process '/root/.local/lib/python3.5/site-packages/gemBS/gemBSbinaries/gemBS_cat' finished with 0
: Creating index
2019-08-26 21:03:48,489 DEBUG: Using bundled binary : /root/.local/lib/python3.5/site-packages/gemBS/gemBSbinaries/gem-indexer
2019-08-26 21:03:48,489 INFO: Starting:
    /root/.local/lib/python3.5/site-packages/gemBS/gemBSbinaries/gem-indexer -i indexes/GRCh38_no_alt_analysis_set_GCA_000001405.15.BS_gemBS.tmp.gz -o indexes/GRCh38_no_alt_analysis_set_GCA_000001405.15.BS -b --tmp-folder indexes/ -s 4 -t 8
2019-08-26 21:03:48,489 DEBUG: Setting process log file to indexes/gem_indexer_GRCh38_no_alt_analysis_set_GCA_000001405.15.BS.gem.err
2019-08-26 21:03:48,489 DEBUG: Starting subprocess
2019-08-26 21:36:56,369 DEBUG: Process '/root/.local/lib/python3.5/site-packages/gemBS/gemBSbinaries/gem-indexer' finished with -9
2019-08-26 21:36:56,731 ERROR: Process '/root/.local/lib/python3.5/site-packages/gemBS/gemBSbinaries/gem-indexer' finished with -9
2019-08-26 21:36:56,798 ERROR: 2019/8/26 21:03:48 -- [Inspecting MultiFASTA]
2019-08-26 21:36:56,798 ERROR: 2019/8/26 21:04:18 --  100% ... done [29.907 s]
2019-08-26 21:36:56,798 ERROR: 2019/8/26 21:04:18 -- Inspected text 12399884961 characters (index_complement=yes). Requesting 11825 MB (encoded text)
2019-08-26 21:36:56,798 ERROR: 2019/8/26 21:04:18 -- [Reading MultiFASTA]
2019-08-26 21:36:56,798 ERROR: 2019/8/26 21:04:19 --  100000000 bases parsed
----------------------------[omitted for brevity]---------------------------------
2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:05:06 --  3100000000 bases parsed
2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:05:06 -- Total 3144255742 bases parsed ...done [48.525 s]
2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:05:06 -- [Generating Bisulfite-Text (C2T & G2A)]
2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:05:17 --  100% ... done [10.268 s]
2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:05:17 -- [Generating Text (explicit Reverse-Complement)]
2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:05:34 --  100% ... done [17.645 s]
2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:05:34 -- [Generating BWT Forward-Text]
2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:05:34 -- [Building-BWT::Counting K-mers]
2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:06:05 --  100% ... done [30.960 s]
2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:06:05 -- [Building-BWT::Generating SA-Positions]
2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:06:06 --    1%
----------------------------[omitted for brevity]---------------------------------
2019-08-26 21:36:56,802 ERROR: 2019/8/26 21:18:41 --  100% ... done [755.087 s]
2019-08-26 21:36:56,802 ERROR: 2019/8/26 21:18:41 -- [Building-BWT::Sorting SA]
2019-08-26 21:36:56,802 ERROR: 2019/8/26 21:20:42 --    0%
2019-08-26 21:36:56,802 ERROR: 2019/8/26 21:21:12 --    1%
----------------------------[omitted for brevity]---------------------------------
2019-08-26 21:36:56,804 ERROR: 2019/8/26 21:36:50 --   50%
Traceback (most recent call last):
  File "/root/.local/bin/gemBS", line 11, in <module>
    load_entry_point('gemBS==3.2.2', 'console_scripts', 'gemBS')()
  File "/root/.local/lib/python3.5/site-packages/gemBS/commands.py", line 157, in gemBS_main
    instances[args.command].run(args)
  File "/root/.local/lib/python3.5/site-packages/gemBS/production.py", line 175, in run
    ret = index(fasta_input, index_name, extra_fasta_files=extra_fasta_files, threads=self.threads, sampling_rate=args.sampling_rate, tmpDir=os.path.dirname(index_name))
  File "/root/.local/lib/python3.5/site-packages/gemBS/__init__.py", line 590, in index
    raise ValueError("Error while executing the Bisulphite gem-indexer")
ValueError: Error while executing the Bisulphite gem-indexer

The commands I used are the following:

gemBS --loglevel debug prepare -c /cromwell_root/wgbs-pipeline/gembs_test_data/ENCSR750CPN_test_run/ENCSR750CPN_gembs_fixed.conf \
              -t /cromwell_root/wgbs-pipeline/gembs_test_data/ENCSR750CPN_test_run/ENCSR750CPN_metadata.csv \
              -o gemBS.json \
              --no-db
gemBS --loglevel debug -j gemBS.json index

The .conf file for prepare looked like this, largely copied from the example:

reference = reference/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz
extra_references = reference/lambda.fa.gz
index_dir = indexes

base = .
sequence_dir = ${base}/fastq/@SAMPLE
bam_dir = ${base}/mapping/@BARCODE
bcf_dir = ${base}/calls/@BARCODE
extract_dir = ${base}/extract/@BARCODE
report_dir = ${base}/report

threads = 8
jobs = 3

[mapping]
underconversion_sequence = chrL

include IHEC_standard.conf

The resulting gemBS.json from prepare looked like this:

{
    sampleData: {
        ENCSR750CPN: {
        sample_name: "ENCSR750CPN",
        sample_barcode: "sample_ENCSR750CPN"
        }
    },
    contigs: { },
    config: {
        mapping: {
            underconversion_sequence: "chrL",
            non_stranded: "False",
            remove_individual_bams: "True"
        },
        DEFAULT: {
            bcf_dir: "./calls/@BARCODE",
            threads: "8",
            gembs_dbfile: "file:gemBS?mode=memory&cache=shared",
            bam_dir: "./mapping/@BARCODE",
            extra_references: "reference/lambda.fa.gz",
            extract_dir: "./extract/@BARCODE",
            reference: "reference/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz",
            base: ".",
            sequence_dir: "./fastq/@SAMPLE",
            nonbs_flag: false,
            index_dir: "indexes",
            jobs: "3",
            report_dir: "./report"
        },
        index: {
            sampling_rate: "4"
        },
        report: { },
        calling: {
            keep_improper_pairs: "False",
            qual_threshold: "13",
            left_trim: "5",
            keep_duplicates: "False",
            remove_individual_bcfs: "True",
            haploid: "False",
            right_trim: "0",
            omit_contigs: [
            "chrL"
            ],
            contig_pool_limit: "25000000",
            conversion: "auto",
            mapq_threshold: "10",
            reference_bias: "2"
        },
        extract: {
            strand_specific: "True",
            make_bedmethyl: "True",
            make_cpg: "True",
            make_non_cpg: "True",
            make_bigwig: "True",
            phred_threshold: "10"
        }
    }
}
heathsc commented 5 years ago

This will be a RAM problem. A human BS index for GEM is about 32GB so you will need at least that (realistically you should have at least 48GB). With GEM we traded index size for speed: the larger index is what gives the speed advantage for GEM.

Simon

On Wed, 4 Sep 2019, 19:25 Paul Sud, notifications@github.com wrote:

Hello,

I am trying to create a gemBS index for GRCh38, specifically this fasta: https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/ . However, it always fails at exactly 50%. I didn't have this issue indexing the test data from http://statgen.cnag.cat/GEMBS/UserGuide/_build/html/example.html . The configuration I used here is very similar. Unfortunately the traceback isn't very informative, the relevant logs are shown below.

I should add, I don't believe it to be a resource issue. I ran it with 8 cores, 12 GB RAM, and 200 GB disk space, and post mortem didn't reveal any concerning usage levels.

: : Command index started at 2019-08-26 21:01:56.662514 : 2019-08-26 21:01:56,689 DEBUG: Using bundled binary : /root/.local/lib/python3.5/site-packages/gemBS/gemBSbinaries/gemBS_cat 2019-08-26 21:01:56,690 DEBUG: Using binary from PATH: /usr/bin/pigz 2019-08-26 21:01:56,690 INFO: Starting: /root/.local/lib/python3.5/site-packages/gemBS/gemBSbinaries/gemBS_cat reference/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz reference/lambda.fa.gz | /usr/bin/pigz 2019-08-26 21:01:56,690 DEBUG: Starting subprocess 2019-08-26 21:01:56,693 DEBUG: File output detected, opening output stream to indexes/GRCh38_no_alt_analysis_set_GCA_000001405.15.BS_gemBS.tmp.gz 2019-08-26 21:01:56,693 DEBUG: Setting process input to parent output 2019-08-26 21:01:56,693 DEBUG: Starting subprocess 2019-08-26 21:03:48,488 DEBUG: Process '/usr/bin/pigz' finished with 0 2019-08-26 21:03:48,488 DEBUG: Process '/root/.local/lib/python3.5/site-packages/gemBS/gemBSbinaries/gemBS_cat' finished with 0 : Creating index 2019-08-26 21:03:48,489 DEBUG: Using bundled binary : /root/.local/lib/python3.5/site-packages/gemBS/gemBSbinaries/gem-indexer 2019-08-26 21:03:48,489 INFO: Starting: /root/.local/lib/python3.5/site-packages/gemBS/gemBSbinaries/gem-indexer -i indexes/GRCh38_no_alt_analysis_set_GCA_000001405.15.BS_gemBS.tmp.gz -o indexes/GRCh38_no_alt_analysis_set_GCA_000001405.15.BS -b --tmp-folder indexes/ -s 4 -t 8 2019-08-26 21:03:48,489 DEBUG: Setting process log file to indexes/gem_indexer_GRCh38_no_alt_analysis_set_GCA_000001405.15.BS.gem.err 2019-08-26 21:03:48,489 DEBUG: Starting subprocess 2019-08-26 21:36:56,369 DEBUG: Process '/root/.local/lib/python3.5/site-packages/gemBS/gemBSbinaries/gem-indexer' finished with -9 2019-08-26 21:36:56,731 ERROR: Process '/root/.local/lib/python3.5/site-packages/gemBS/gemBSbinaries/gem-indexer' finished with -9 2019-08-26 21:36:56,798 ERROR: 2019/8/26 21:03:48 -- [Inspecting MultiFASTA] 2019-08-26 21:36:56,798 ERROR: 2019/8/26 21:04:18 -- 100% ... done [29.907 s] 2019-08-26 21:36:56,798 ERROR: 2019/8/26 21:04:18 -- Inspected text 12399884961 characters (index_complement=yes). Requesting 11825 MB (encoded text) 2019-08-26 21:36:56,798 ERROR: 2019/8/26 21:04:18 -- [Reading MultiFASTA] 2019-08-26 21:36:56,798 ERROR: 2019/8/26 21:04:19 -- 100000000 bases parsed ----------------------------[omitted for brevity]--------------------------------- 2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:05:06 -- 3100000000 bases parsed 2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:05:06 -- Total 3144255742 bases parsed ...done [48.525 s] 2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:05:06 -- [Generating Bisulfite-Text (C2T & G2A)] 2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:05:17 -- 100% ... done [10.268 s] 2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:05:17 -- [Generating Text (explicit Reverse-Complement)] 2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:05:34 -- 100% ... done [17.645 s] 2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:05:34 -- [Generating BWT Forward-Text] 2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:05:34 -- [Building-BWT::Counting K-mers] 2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:06:05 -- 100% ... done [30.960 s] 2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:06:05 -- [Building-BWT::Generating SA-Positions] 2019-08-26 21:36:56,799 ERROR: 2019/8/26 21:06:06 -- 1% ----------------------------[omitted for brevity]--------------------------------- 2019-08-26 21:36:56,802 ERROR: 2019/8/26 21:18:41 -- 100% ... done [755.087 s] 2019-08-26 21:36:56,802 ERROR: 2019/8/26 21:18:41 -- [Building-BWT::Sorting SA] 2019-08-26 21:36:56,802 ERROR: 2019/8/26 21:20:42 -- 0% 2019-08-26 21:36:56,802 ERROR: 2019/8/26 21:21:12 -- 1% ----------------------------[omitted for brevity]--------------------------------- 2019-08-26 21:36:56,804 ERROR: 2019/8/26 21:36:50 -- 50% Traceback (most recent call last): File "/root/.local/bin/gemBS", line 11, in load_entry_point('gemBS==3.2.2', 'console_scripts', 'gemBS')() File "/root/.local/lib/python3.5/site-packages/gemBS/commands.py", line 157, in gemBS_main instances[args.command].run(args) File "/root/.local/lib/python3.5/site-packages/gemBS/production.py", line 175, in run ret = index(fasta_input, index_name, extra_fasta_files=extra_fasta_files, threads=self.threads, sampling_rate=args.sampling_rate, tmpDir=os.path.dirname(index_name)) File "/root/.local/lib/python3.5/site-packages/gemBS/init.py", line 590, in index raise ValueError("Error while executing the Bisulphite gem-indexer") ValueError: Error while executing the Bisulphite gem-indexer

The commands I used are the following:

gemBS --loglevel debug prepare -c /cromwell_root/wgbs-pipeline/gembs_test_data/ENCSR750CPN_test_run/ENCSR750CPN_gembs_fixed.conf \ -t /cromwell_root/wgbs-pipeline/gembs_test_data/ENCSR750CPN_test_run/ENCSR750CPN_metadata.csv \ -o gemBS.json \ --no-db gemBS --loglevel debug -j gemBS.json index

The .conf file for prepare looked like this, largely copied from the example:

reference = reference/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz extra_references = reference/lambda.fa.gz index_dir = indexes

base = . sequence_dir = ${base}/fastq/@SAMPLE bam_dir = ${base}/mapping/@BARCODE bcf_dir = ${base}/calls/@BARCODE extract_dir = ${base}/extract/@BARCODE report_dir = ${base}/report

threads = 8 jobs = 3

[mapping] underconversion_sequence = chrL

include IHEC_standard.conf

The resulting gemBS.json from prepare looked like this:

{ sampleData: { ENCSR750CPN: { sample_name: "ENCSR750CPN", sample_barcode: "sample_ENCSR750CPN" } }, contigs: { }, config: { mapping: { underconversion_sequence: "chrL", non_stranded: "False", remove_individual_bams: "True" }, DEFAULT: { bcf_dir: "./calls/@BARCODE", threads: "8", gembs_dbfile: "file:gemBS?mode=memory&cache=shared", bam_dir: "./mapping/@BARCODE", extra_references: "reference/lambda.fa.gz", extract_dir: "./extract/@BARCODE", reference: "reference/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz", base: ".", sequence_dir: "./fastq/@SAMPLE", nonbs_flag: false, index_dir: "indexes", jobs: "3", report_dir: "./report" }, index: { sampling_rate: "4" }, report: { }, calling: { keep_improper_pairs: "False", qual_threshold: "13", left_trim: "5", keep_duplicates: "False", remove_individual_bcfs: "True", haploid: "False", right_trim: "0", omit_contigs: [ "chrL" ], contig_pool_limit: "25000000", conversion: "auto", mapq_threshold: "10", reference_bias: "2" }, extract: { strand_specific: "True", make_bedmethyl: "True", make_cpg: "True", make_non_cpg: "True", make_bigwig: "True", phred_threshold: "10" } } }

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/61?email_source=notifications&email_token=AAY4654DD2VWHLT7BQQHPSDQH7VPVA5CNFSM4ITUQ3IKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HJJ72IQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AAY4653KXBAJVTM5FTMERKTQH7VPVANCNFSM4ITUQ3IA .

paul-sud commented 5 years ago

Thanks for the quick response Simon. I'll try again with 48GB RAM

paul-sud commented 5 years ago

Indexing succeeded with the above amount of RAM. Thanks again!