current failure - Githubissues

mikecroucher commented 8 years ago

The current script craps out on Sandeep's example dataset

Traceback (most recent call last):
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a//tools/bin/bcbio_nextgen.py", line 4, in <module>
    __import__('pkg_resources').run_script('bcbio-nextgen==0.9.6a0', 'bcbio_nextgen.py')
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/setuptools-19.1.1-py2.7.egg/pkg_resources/__init__.py", line 745, in run_script

  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/setuptools-19.1.1-py2.7.egg/pkg_resources/__init__.py", line 1670, in run_script

  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio_nextgen-0.9.6a0-py2.7.egg-info/scripts/bcbio_nextgen.py", line 226, in <module>
    main(**kwargs)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio_nextgen-0.9.6a0-py2.7.egg-info/scripts/bcbio_nextgen.py", line 43, in main
    run_main(**kwargs)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 39, in run_main
    fc_dir, run_info_yaml)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 82, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 240, in rnaseqpipeline
    samples = run_parallel("process_alignment", samples)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 804, in __call__
    while self.dispatch_one_batch(iterator):
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 662, in dispatch_one_batch
    self._dispatch(tasks)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 570, in _dispatch
    job = ImmediateComputeBatch(batch)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 183, in __init__
    self.results = batch()
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 51, in wrapper
    return apply(f, *args, **kwargs)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 39, in process_alignment
    return sample.process_alignment(*args)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/sample.py", line 140, in process_alignment
    "If it is a fastq file (not pre-aligned BAM or CRAM), "
ValueError: Could not process input file from sample configuration. 
/fastdata/fe1mpc/sandeep/TDP_Omics_Study/Data/R1/CytTrc_Control/31-SHF7_141009_L004_R1.fastq.gz
Is the path to the file correct or is empty?
If it is a fastq file (not pre-aligned BAM or CRAM), is an aligner specified in the input configuration?

roryk commented 8 years ago

Hi Mike,

What does the final ../config/tdp43_project.yaml file look like? What we usually do is download the template file with wget if you need to change something, change it in the template and then run the tempating like:

bcbio_nextgen.py -w template changed-project-template.yaml path-to-a-csv-file-with-metadata.csv sample1.fq sample2.fq etc

https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#automated-sample-configuration

has some more information.

mikecroucher commented 8 years ago

It's because the file it's attempting to load is empty. Here's info on Sandeep's original (my copy is identical).

ls /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/Data/R1/CytTrc_Control/ -l 
total 4396072

-rwxrwxrwx 1 md4zsa md          0 Jan  8 11:24 31-SHF7_141009_L004_R1.fastq.gz
-rwxrwxrwx 1 md4zsa md 2269179215 Jan  7 17:09 32-SHF8_141009_L004_R1.fastq.gz
-rwxrwxrwx 1 md4zsa md 2214725239 Jan  7 17:09 33-SHF9_141009_L004_R1.fastq.gz

ssamberkar commented 8 years ago

Hi Mike, Rory,

I tried running the automatic config setup using the following --

bcbio_nextgen.py -w template $work_dir/tdp43_project/config/tdp43_project-template.yaml $work_dir/tdp43_project.csv ${tdp43_r1[@]} ${tdp43_r2[@]}

where tdp43_project.csv is the metadata file. On which I get the Index out of range error as below:

/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
Traceback (most recent call last):
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a//tools/bin/bcbio_nextgen.py", line 4, in <module>
    __import__('pkg_resources').run_script('bcbio-nextgen==0.9.6a0', 'bcbio_nextgen.py')
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/setuptools-19.1.1-py2.7.egg/pkg_resources/__init__.py", line 745, in run_script

  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/setuptools-19.1.1-py2.7.egg/pkg_resources/__init__.py", line 1670, in run_script

  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio_nextgen-0.9.6a0-py2.7.egg-info/scripts/bcbio_nextgen.py", line 220, in <module>
    setup_info = workflow.setup(kwargs["workflow"], kwargs.pop("inputs"))
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/workflow/__init__.py", line 12, in setup
    return workflow.setup(args)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/workflow/template.py", line 391, in setup
    project_name, metadata, global_vars, md_file = _pname_and_metadata(args.metadata)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/workflow/template.py", line 258, in _pname_and_metadata
    md, global_vars = _parse_metadata(in_handle)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/workflow/template.py", line 229, in _parse_metadata
    for sinfo in (x for x in reader if not x[0].startswith("#")):
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/workflow/template.py", line 229, in <genexpr>
    for sinfo in (x for x in reader if not x[0].startswith("#")):
IndexError: list index out of range

This happens even after specifying a full filename array(bash), a single element from the array and explicitly specifying file path.

I also get the same error as Mike does if I run the same code snippet on the $work_dir/tdp43_project instead of the .csv metadata file.

ssamberkar commented 8 years ago

Here's how the tdp43_project.csv looks like:

samplename,description,phenotype,batch
31-SHF7_141009_L004,CytTrc_Control,Control,1
32-SHF8_141009_L004,CytTrc_Control,Control,2
33-SHF9_141009_L004,CytTrc_Control,Control,3
34-SHF10_141009_L004,CytTrc_Q331K,Q331K,1
35-SHF11_141009_L004,CytTrc_Q331K,Q331K,2
36-SHF12_141009_L004,CytTrc_Q331K,Q331K,3
13-SHF13_141009_L001,GRASPS_TRL_Control,Control,1
14-SHF14_141009_L001,GRASPS_TRL_Control,Control,2
15-SHF15_141009_L001,GRASPS_TRL_Control,Control,3
22-SHF22_141009_L002,GRASPS_TRL_GFPHigh,GFPHigh,1
23-SHF23_141009_L002,GRASPS_TRL_GFPHigh,GFPHigh,2
24-SHF24_141009_L002,GRASPS_TRL_GFPHigh,GFPHigh,3
19-SHF19_141009_L001,GRASPS_TRL_GFPLow,GFPLow,1
20-SHF20_141009_L002,GRASPS_TRL_GFPLow,GFPLow,2
21-SHF21_141009_L002,GRASPS_TRL_GFPLow,GFPLow,3
16-SHF16_141009_L001,GRASPS_TRL_Q331K,Q331K,1
17-SHF17_141009_L001,GRASPS_TRL_Q331K,Q331K,2
37-SHF18_141009_L002,GRASPS_TRL_Q331K,Q331K,3
HAQ-10,Other_HAQ,HAQ,1
HAQ-11,Other_HAQ,HAQ,1
HAQ-12,Other_HAQ,HAQ,1
HAQ-13,Other_HAQ,HAQ,1
HAQ-14,Other_HAQ,HAQ,1
HAQ-15,Other_HAQ,HAQ,1
HAQ-16,Other_HAQ,HAQ,1
HAQ-1,Other_HAQ,HAQ,1
HAQ-2,Other_HAQ,HAQ,1
HAQ-3,Other_HAQ,HAQ,1
HAQ-4,Other_HAQ,HAQ,1
HAQ-5,Other_HAQ,HAQ,1
HAQ-6,Other_HAQ,HAQ,1
HAQ-7A,Other_HAQ,HAQ,1
HAQ-8,Other_HAQ,HAQ,1
HAQ-9,Other_HAQ,HAQ,1
25-SHF1_141009_L003,WCT_Control,Control,1
26-SHF2_141009_L003,WCT_Control,Control,2
27-SHF3_141009_L003,WCT_Control,Control,3
28-SHF4_141009_L003,WCT_Q331K,Q331K,1
29-SHF5_141009_L003,WCT_Q331K,Q331K,2
30-SHF6_141009_L003,WCT_Q331K,Q331K,3

roryk commented 8 years ago

Hi Mike and Sandeep,

Is there whitespace at the bottom of the file?

ssamberkar commented 8 years ago

I think, yes. Will that interfere with the parsing?

ssamberkar commented 8 years ago

Indeed it causes the error. I removed the whitespaces and rerun it only to be stumped at this one:

/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
[2016-01-11T14:50Z] Resource requests: picard; memory: 3.50; cores: 1
[2016-01-11T14:50Z] Configuring 1 jobs to run, using 1 cores each with 3.50g of memory reserved for each job
[2016-01-11T14:50Z] run local -- checkpoint passed: trimming
[2016-01-11T14:50Z] Timing: organize samples
[2016-01-11T14:50Z] multiprocessing: organize_samples
[2016-01-11T14:50Z] Using input YAML configuration: /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/tdp43_project/config/tdp43_project.yaml
[2016-01-11T14:50Z] Checking sample YAML configuration: /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/tdp43_project/config/tdp43_project.yaml
Traceback (most recent call last):
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a//tools/bin/bcbio_nextgen.py", line 4, in <module>
    __import__('pkg_resources').run_script('bcbio-nextgen==0.9.6a0', 'bcbio_nextgen.py')
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/setuptools-19.1.1-py2.7.egg/pkg_resources/__init__.py", line 745, in run_script

  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/setuptools-19.1.1-py2.7.egg/pkg_resources/__init__.py", line 1670, in run_script

  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio_nextgen-0.9.6a0-py2.7.egg-info/scripts/bcbio_nextgen.py", line 226, in <module>
    main(**kwargs)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio_nextgen-0.9.6a0-py2.7.egg-info/scripts/bcbio_nextgen.py", line 43, in main
    run_main(**kwargs)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 39, in run_main
    fc_dir, run_info_yaml)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 82, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 233, in rnaseqpipeline
    samples = rnaseq_prep_samples(config, run_info_yaml, parallel, dirs, samples)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 357, in rnaseq_prep_samples
    [x[0]["description"] for x in samples]]])
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 804, in __call__
    while self.dispatch_one_batch(iterator):
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 662, in dispatch_one_batch
    self._dispatch(tasks)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 570, in _dispatch
    job = ImmediateComputeBatch(batch)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 183, in __init__
    self.results = batch()
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 51, in wrapper
    return apply(f, *args, **kwargs)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 239, in organize_samples
    return run_info.organize(*args)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/run_info.py", line 44, in organize
    run_details = _run_info_from_yaml(dirs, run_info_yaml, config, sample_names)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/run_info.py", line 677, in _run_info_from_yaml
    _check_sample_config(run_details, run_info_yaml, config)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/run_info.py", line 532, in _check_sample_config
    _check_for_duplicates(items, "description")
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/run_info.py", line 293, in _check_for_duplicates
    "Problem found in these samples: %s" % (attr, dups, descrs))
ValueError: Duplicate 'description' found in input sample configuration.
Required to be unique for a project: ['CytTrc_Control', 'CytTrc_Q331K', 'GRASPS_TRL_Control', 'GRASPS_TRL_GFPHigh', 'GRASPS_TRL_GFPLow', 'GRASPS_TRL_Q331K', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'WCT_Control', 'WCT_Q331K']
Problem found in these samples: ['CytTrc_Control', 'CytTrc_Control', 'CytTrc_Control', 'CytTrc_Q331K', 'CytTrc_Q331K', 'CytTrc_Q331K', 'GRASPS_TRL_Control', 'GRASPS_TRL_Control', 'GRASPS_TRL_Control', 'GRASPS_TRL_GFPHigh', 'GRASPS_TRL_GFPHigh', 'GRASPS_TRL_GFPHigh', 'GRASPS_TRL_GFPLow', 'GRASPS_TRL_GFPLow', 'GRASPS_TRL_GFPLow', 'GRASPS_TRL_Q331K', 'GRASPS_TRL_Q331K', 'GRASPS_TRL_Q331K', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'WCT_Control', 'WCT_Control', 'WCT_Control', 'WCT_Q331K', 'WCT_Q331K', 'WCT_Q331K']

So again, back to basics. So if I have replicates, should I be using the 1-sample-multiple-files configuration??

roryk commented 8 years ago

Hi Sandeep,

Are

31-SHF7_141009_L004,CytTrc_Control,Control,1
32-SHF8_141009_L004,CytTrc_Control,Control,2
33-SHF9_141009_L004,CytTrc_Control,Control,3

those the same sample or different samples?

http://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#automated-sample-configuration has some more information about what each field is. samplename is the stem of the flename and description is a human-readable description of the sample. description needs to be different if the samples are different.

roryk commented 8 years ago

If they are the same sample using the 1-sample-multiple-files configuration will work. You can also cat the files together into one file yourself, either will work.

roryk commented 8 years ago

I fixed the whitespace issue so it will skip those lines now instead of failing.

ssamberkar commented 8 years ago

Hi Rory,

31-SHF7_141009_L004,CytTrc_Control,Control,1
32-SHF8_141009_L004,CytTrc_Control,Control,2
33-SHF9_141009_L004,CytTrc_Control,Control,3

These are replicates of each other but by definition are unique samples, aren't they? In either case, shall I have unique Batch IDs or the same? Thanks for fixing the whitespace issue.

roryk commented 8 years ago

Hi Sandeep,

I'm not sure, it depends on what you want to do with the samples. If they are technical replicates or biological replicates that you want to analyze separately, they would need unique descriptions in order to be treated separately. I see you have them labelled as coming from three different batches, so maybe labeling them CytTrc_Control_Batch1, etc would make sense if you want to keep the batches separate. If they are a single sample spread across multiple lanes of sequencing then they'd need to either be combined by catting the FASTQ files together or using the 1-sample-multiple-files. The 1-sample-multiple-files just cats the files together for you, it isn't doing anything fancy.

ssamberkar commented 8 years ago

Hi Rory,

I merged all 3 sample files into one (I did it manually), so that the metadata file looks like this:

samplename,description,phenotype,batch
all_CytTrc_Control,CytTrc_Control,Control,1
all_CytTrc_Q331K,CytTrc_Q331K,Q331K,1
all_GRASPS_TRL,GRASPS_TRL_Control,Control,1
all_GRASPS_TRL,GRASPS_TRL_GFPHigh,GFPHigh,1
all_GRASPS_TRL,GRASPS_TRL_GFPLow,GFPLow,1
all_GRASPS_TRL,GRASPS_TRL_Q331K,Q331K,1
all_WCT_Control,WCT_Control,Control,1
all_WCT_Q331K,WCT_Q331K,Q331K,1

The pipeline starts and now it returns the index not found error with the following message:

/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
[2016-01-11T20:13Z] Resource requests: picard; memory: 3.50; cores: 1
[2016-01-11T20:13Z] Configuring 1 jobs to run, using 1 cores each with 3.50g of memory reserved for each job
[2016-01-11T20:13Z] Timing: organize samples
[2016-01-11T20:13Z] multiprocessing: organize_samples
[2016-01-11T20:13Z] Using input YAML configuration: /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/tdp43_project/config/tdp43_project.yaml
[2016-01-11T20:13Z] Checking sample YAML configuration: /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/tdp43_project/config/tdp43_project.yaml
[2016-01-11T20:13Z] Testing minimum versions of installed programs
[2016-01-11T20:13Z] multiprocessing: prepare_sample
[2016-01-11T20:13Z] Preparing CytTrc_Control
[2016-01-11T20:13Z] Preparing CytTrc_Q331K
[2016-01-11T20:13Z] Preparing GRASPS_TRL_Control
[2016-01-11T20:13Z] Preparing GRASPS_TRL_GFPHigh
[2016-01-11T20:13Z] Preparing GRASPS_TRL_GFPLow
[2016-01-11T20:13Z] Preparing GRASPS_TRL_Q331K
[2016-01-11T20:13Z] Preparing WCT_Control
[2016-01-11T20:13Z] Preparing WCT_Q331K
[2016-01-11T20:13Z] Resource requests: hisat2, picard; memory: 2.00, 3.50; cores: 16, 1
[2016-01-11T20:13Z] Configuring 1 jobs to run, using 1 cores each with 8.00g of memory reserved for each job
[2016-01-11T20:13Z] Timing: alignment
[2016-01-11T20:13Z] multiprocessing: disambiguate_split
[2016-01-11T20:13Z] multiprocessing: process_alignment
[2016-01-11T20:13Z] Aligning lane 1_2016-01-11_tdp43_project with hisat2 aligner
[2016-01-11T20:13Z] Aligning /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/Data/R1/CytTrc_Control/all_CytTrc_Control_R1.gz and /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/Data/R2/CytTrc_Control/all_CytTrc_Control_R2.gz with hisat2.
[2016-01-11T20:13Z] Could not locate a HISAT2 index corresponding to basename "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/genomes/Hsapiens/hg38/hisat2/hg38"
[2016-01-11T20:13Z] Error: Encountered internal HISAT exception (#1)
[2016-01-11T20:13Z] Command: /usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/tools/bin/../../anaconda/bin/hisat2-align-s --wrapper basic-0 -x /usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/genomes/Hsapiens/hg38/hisat2/hg38 -p 1 --phred33 --rg-id 1 --rg PL:illumina --rg PU:1_2016-01-11_tdp43_project --rg SM:CytTrc_Control --known-splicesite-infile /usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/genomes/Hsapiens/hg38/rnaseq/ref-transcripts-splicesites.txt -1 /tmp/3225.inpipe1 -2 /tmp/3225.inpipe2
[2016-01-11T20:13Z] (ERR): hisat2-align exited with value 1

Is bcbio unable to find the indices for hisat2?

roryk commented 8 years ago

Hi @ssamberkar,

Sorry for the problems-- it does look like it cannot find the indices. Are they installed?

ls -l /usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/genomes/Hsapiens/hg38/hisat2

should have the indices in it.

mikecroucher commented 8 years ago

Hi guys

How's this?

ls -l /usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/genomes/Hsapiens/hg38/hisat2

total 16532

-rw-r--r-- 1 fe1mpc app-admins 7942745 Jan 7 08:50 hg38.exons

-rw-r--r-- 1 fe1mpc app-admins 8900759 Jan 7 08:50 hg38.splicesites

Installed via the command

bcbio_nextgen.py upgrade --genomes hg38 --aligners hisat2

On 11 January 2016 at 20:21, Rory Kirchner notifications@github.com wrote:

Hi @ssamberkar https://github.com/ssamberkar,

Sorry for the problems-- it does look like it cannot find the indices. Are they installed?

ls -l /usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/genomes/Hsapiens/hg38/hisat2

should have the indices in it.

— Reply to this email directly or view it on GitHub https://github.com/mikecroucher/bcbio_examples/issues/1#issuecomment-170678303 .

roryk commented 8 years ago

Hi Mike,

Strange. It looks like the hisat2 index did not get installed, could you try installing them with:

bcbio_nextgen.py upgrade --data --genomes hg38 --aligners hisat2

mikecroucher commented 8 years ago

Done with the script:

#!/bin/bash
#$ -l rmem=32G -l mem=32G
#$ -P radiant

module load apps/gcc/5.2/bcbio/0.9.6a
bcbio_nextgen.py upgrade --data --genomes hg38 --aligners hisat2

stderr:

INFO: Reading default fabricrc.txt
DBG [config.py]: Using config file /data/fe1mpc/tmpbcbio-install/cloudbiolinux/cloudbio/../config/fabricrc.txt
INFO: Distribution __auto__
INFO: Get local environment
INFO: ScientificLinux setup
DBG [distribution.py]: NixPkgs: Ignored
INFO: Now, testing connection to host...
INFO: Connection to host appears to work!
DBG [utils.py]: Expand paths
INFO: List of genomes to get (from the config file at '{'install_liftover': False, 'genome_indexes': ['hisat2', 'bwa', 'bowtie2', 'rtg'], 'genomes': [{'rnaseq': True, 'validation': ['platinum-genome-NA12878', 'giab-NA12878-remap', 'giab-NA12878-crossmap', 'dream-syn4-crossmap', 'dream-syn3-crossmap'], 'name': 'Human (hg38) full', 'dbkey': 'hg38', 'annotations': ['dbsnp', 'clinvar', 'mills_indels', '1000g_snps', '1000g_indels', '1000g_omni_snps', 'hapmap_snps', 'coverage', 'prioritize']}], 'install_uniref': False}'): Human (hg38) full

stdout:

Upgrading bcbio-nextgen data files
Setting up virtual machine
[localhost] local: echo $HOME
[localhost] local: uname -m
bcbio-nextgen data upgrade complete.
Upgrade completed successfully.

ls -l /usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/genomes/Hsapiens/hg38/hisat2

total 16532
-rw-r--r-- 1 fe1mpc app-admins 7942745 Jan 12 07:10 hg38.exons
-rw-r--r-- 1 fe1mpc app-admins 8900759 Jan 12 07:10 hg38.splicesites

Same two files but with a new date

ssamberkar commented 8 years ago

Hi Mike, Rory

It would sound stupid but what if we have symlinks to hg38 indices created/downloaded externally to bcbio, placed at this location? Intuitively if it is "only" a case of missing files and hisat2 would scan for a directory identical to the index name at this location??

mikecroucher commented 8 years ago

Although that might solve our immediate issue, I feel that it's papering over the cracks.

I firmly believe that we should do this via bcbio's mechanisms. That's the best long term outcome.

ssamberkar commented 8 years ago

I agree to that, if we have 3-4 different aligners and every time bcbio throws the same error for each aligner, it is an extra circus to copy those files..

roryk commented 8 years ago

Hi Mike and Sandeep,

Sorry, could you remove the partially-installed hisat2 directory and retry the install of the hisat2 index? All this step is doing is downloading a file and unzipping it into that directory.

Best,

Rory

ssamberkar commented 8 years ago

Hi Rory,

If this fails, will the work around I posted above work? hisat2 provides precompiled indices already and if we do place these indices outside of bcbio install, we won't be changing it for quite some time unless need be. As much as we should avoid such practice, we can hope this would get bcbio running.

Is bcbio running the index-build during install or is it just downloading the precompiled ones from hisat2 page? Thanks, Sandeep

roryk commented 8 years ago

Hi Sandeep,

It is downloading an index, but not the one from the hisat2 site, It is downloading indices that we built. It is better to use the bcbio-nextgen installed indices so we know that the annotation matches up to the index.

roryk commented 8 years ago

For doing gene expression runs, using GRCh37 is fine, the hg38 support is pretty new and we haven't fully validated it.

ssamberkar commented 8 years ago

So in short, if I use GRCh37 and hg38, the analysis should proceed normally? I believe GRCh37 is already in place, @mikecroucher could you please confirm?

roryk commented 8 years ago

Hi Sandeep,

You'll need to pick one of the genomes to use for the project, either GRCh37 or hg38, both won't work.

mikecroucher commented 8 years ago

I can install them both though right? The bioinformatics peeps can then choose. On 12 Jan 2016 17:04, "Rory Kirchner" notifications@github.com wrote:

Hi Sandeep,

You'll need to pick one of the genomes to use for the project, either GRCh37 or hg38, both won't work.

— Reply to this email directly or view it on GitHub https://github.com/mikecroucher/bcbio_examples/issues/1#issuecomment-170976607 .

roryk commented 8 years ago

Yup-- you can install however many genomes you want and then choose which one to use by editing the template file.

ssamberkar commented 8 years ago

Hi Rory,

I want to use hg38 but I didn't know yet if it has been validated. I'll of course choose only one of them.. ;-) Good to know this before kicking off anything. I saw hg19 and hg38 folders, GRCh37 needs to be installed.

roryk commented 8 years ago

We have done much more work using hg38 for variant calling where it seems to be good to use, but there is a similar problem where some of the downstream tools don't fully support hg38 yet.

mikecroucher commented 8 years ago

I didn't install GRCh37 because it wasn't asked for. I'll get it added.

Rory...could you help me with the exact upgrade command to use please. On 12 Jan 2016 17:10, "Rory Kirchner" notifications@github.com wrote:

We have done much more work using hg38 for variant calling where it seems to be good to use, but there is a similar problem where some of the downstream tools don't fully support hg38 yet.

— Reply to this email directly or view it on GitHub https://github.com/mikecroucher/bcbio_examples/issues/1#issuecomment-170977940 .

roryk commented 8 years ago

bcbio_nextgen.py upgrade --data --genomes GRCh37 --aligners bwa --aligners star should get you going.

ssamberkar commented 8 years ago

Right, seems strange that it hasn't been mainstream yet. It is already 2 year old.. @mikecroucher -- Sorry about the confusion mike, I was under the impression that hg38 availability was validated. Since Rory has confirmed otherwise, we would wait until their nod for it.

ssamberkar commented 8 years ago

Hi Rory, So we now have GRCh37 installed but bcbio throws a new error now:

/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
[2016-01-12T19:28Z] Resource requests: picard; memory: 3.50; cores: 1
[2016-01-12T19:28Z] Configuring 8 jobs to run, using 1 cores each with 3.50g of memory reserved for each job
[2016-01-12T19:28Z] run local -- checkpoint passed: trimming
[2016-01-12T19:28Z] Timing: organize samples
[2016-01-12T19:28Z] multiprocessing: organize_samples
[2016-01-12T19:28Z] Using input YAML configuration: /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/tdp43_project/config/tdp43_project.yaml
[2016-01-12T19:28Z] Checking sample YAML configuration: /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/tdp43_project/config/tdp43_project.yaml
[2016-01-12T19:28Z] Downloading GRCh37 samtools from AWS
Traceback (most recent call last):
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a//tools/bin/bcbio_nextgen.py", line 4, in <module>
    __import__('pkg_resources').run_script('bcbio-nextgen==0.9.6a0', 'bcbio_nextgen.py')
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/setuptools-19.1.1-py2.7.egg/pkg_resources/__init__.py", line 745, in run_script

  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/setuptools-19.1.1-py2.7.egg/pkg_resources/__init__.py", line 1670, in run_script

  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio_nextgen-0.9.6a0-py2.7.egg-info/scripts/bcbio_nextgen.py", line 226, in <module>
    main(**kwargs)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio_nextgen-0.9.6a0-py2.7.egg-info/scripts/bcbio_nextgen.py", line 43, in main
    run_main(**kwargs)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 39, in run_main
    fc_dir, run_info_yaml)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 82, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 233, in rnaseqpipeline
    samples = rnaseq_prep_samples(config, run_info_yaml, parallel, dirs, samples)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 357, in rnaseq_prep_samples
    [x[0]["description"] for x in samples]]])
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 804, in __call__
    while self.dispatch_one_batch(iterator):
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 662, in dispatch_one_batch
    self._dispatch(tasks)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 570, in _dispatch
    job = ImmediateComputeBatch(batch)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 183, in __init__
    self.results = batch()
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 51, in wrapper
    return apply(f, *args, **kwargs)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 239, in organize_samples
    return run_info.organize(*args)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/run_info.py", line 58, in organize
    item = add_reference_resources(item)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/run_info.py", line 145, in add_reference_resources
    data["reference"] = genome.get_refs(data["genome_build"], aligner, data["dirs"]["galaxy"], data)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/genome.py", line 199, in get_refs
    galaxy_config, data)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/genome.py", line 151, in _get_ref_from_galaxy_loc
    cur_ref = download_prepped_genome(genome_build, data, name, need_remap)
  File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/genome.py", line 261, in download_prepped_genome
    raise ValueError("Could not find reference genome file %s %s" % (genome_build, name))
ValueError: Could not find reference genome file GRCh37 samtools

Suspect it is now searching for GRCh37.fasta?

mikecroucher commented 8 years ago

steady on there Sandeep....it's still building

ssamberkar commented 8 years ago

Ouch.. sorry.. I just looked if there were GRCh37 folder and files and I kicked off..

mikecroucher commented 8 years ago

No worries. Stuck on STAR for the last 35 minutes or so...still going:

Resolving s3.amazonaws.com... 54.231.15.16
Connecting to s3.amazonaws.com|54.231.15.16|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2016-01-12 19:02:15 ERROR 404: Not Found.

Warning: local() encountered an error (return code 8) while executing 'wget --continue --no-check-certificate -O GRCh37-star.tar.xz 'https://s3.amazonaws.com/biodata/genomes/GRCh37-star.tar.xz''

INFO: Genome preparation method s3 failed, trying next
INFO: Preparing genome GRCh37 with index star

roryk commented 8 years ago

It should take less than a couple hours to do the STAR index. If it is still stalling out we can open up an issue on the rna-star repository (https://github.com/alexdobin/STAR) and try to figure out what is going on with Alex. We'll need the Log.out file that STAR generates and some information about your system to figure out what is going on.

mikecroucher commented 8 years ago

Yep. Stuck on STAR.

Out.log sent to Rory and Brad.

mikecroucher commented 8 years ago

Thanks to Rory and Brad, I think we may be in business. The modifications I've made to the install are reflected in

http://rcg.group.shef.ac.uk/iceberg/software/apps/bcbio.html

Sandeep - could you have a try please?

ssamberkar commented 8 years ago

Surely Mike, I just fired our first analysis after a quick sanity check. Hopefully it should run through just fine! Fingers crossed.. One parting query to Rory -- since you mentioned hisat2 will soft clip adapters, is the cutadapt step be excluded from the template for future runs? Will running cutadapt with hisat2 impact the results for the worse? Thanks all for team work. Big thumbs up!!

roryk commented 8 years ago

Hi Sandeep,

That's right-- there is not much need to clip adapters if using any aligner other than Tophat2 if you are just looking at gene expression. It can help with transcriptome assembly so if you are doing that it might help to click it on.

mikecroucher / bcbio_examples

current failure #1