Open mikecroucher opened 8 years ago
Hi Mike,
What does the final ../config/tdp43_project.yaml
file look like? What we usually do is download the template file with wget if you need to change something, change it in the template and then run the tempating like:
bcbio_nextgen.py -w template changed-project-template.yaml path-to-a-csv-file-with-metadata.csv sample1.fq sample2.fq etc
has some more information.
It's because the file it's attempting to load is empty. Here's info on Sandeep's original (my copy is identical).
ls /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/Data/R1/CytTrc_Control/ -l
total 4396072
-rwxrwxrwx 1 md4zsa md 0 Jan 8 11:24 31-SHF7_141009_L004_R1.fastq.gz
-rwxrwxrwx 1 md4zsa md 2269179215 Jan 7 17:09 32-SHF8_141009_L004_R1.fastq.gz
-rwxrwxrwx 1 md4zsa md 2214725239 Jan 7 17:09 33-SHF9_141009_L004_R1.fastq.gz
Hi Mike, Rory,
I tried running the automatic config setup using the following --
bcbio_nextgen.py -w template $work_dir/tdp43_project/config/tdp43_project-template.yaml $work_dir/tdp43_project.csv ${tdp43_r1[@]} ${tdp43_r2[@]}
where tdp43_project.csv is the metadata file. On which I get the Index out of range error as below:
/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
warnings.warn(self.msg_depr % (key, alt_key))
Traceback (most recent call last):
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a//tools/bin/bcbio_nextgen.py", line 4, in <module>
__import__('pkg_resources').run_script('bcbio-nextgen==0.9.6a0', 'bcbio_nextgen.py')
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/setuptools-19.1.1-py2.7.egg/pkg_resources/__init__.py", line 745, in run_script
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/setuptools-19.1.1-py2.7.egg/pkg_resources/__init__.py", line 1670, in run_script
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio_nextgen-0.9.6a0-py2.7.egg-info/scripts/bcbio_nextgen.py", line 220, in <module>
setup_info = workflow.setup(kwargs["workflow"], kwargs.pop("inputs"))
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/workflow/__init__.py", line 12, in setup
return workflow.setup(args)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/workflow/template.py", line 391, in setup
project_name, metadata, global_vars, md_file = _pname_and_metadata(args.metadata)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/workflow/template.py", line 258, in _pname_and_metadata
md, global_vars = _parse_metadata(in_handle)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/workflow/template.py", line 229, in _parse_metadata
for sinfo in (x for x in reader if not x[0].startswith("#")):
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/workflow/template.py", line 229, in <genexpr>
for sinfo in (x for x in reader if not x[0].startswith("#")):
IndexError: list index out of range
This happens even after specifying a full filename array(bash), a single element from the array and explicitly specifying file path.
I also get the same error as Mike does if I run the same code snippet on the $work_dir/tdp43_project instead of the .csv metadata file.
Here's how the tdp43_project.csv looks like:
samplename,description,phenotype,batch
31-SHF7_141009_L004,CytTrc_Control,Control,1
32-SHF8_141009_L004,CytTrc_Control,Control,2
33-SHF9_141009_L004,CytTrc_Control,Control,3
34-SHF10_141009_L004,CytTrc_Q331K,Q331K,1
35-SHF11_141009_L004,CytTrc_Q331K,Q331K,2
36-SHF12_141009_L004,CytTrc_Q331K,Q331K,3
13-SHF13_141009_L001,GRASPS_TRL_Control,Control,1
14-SHF14_141009_L001,GRASPS_TRL_Control,Control,2
15-SHF15_141009_L001,GRASPS_TRL_Control,Control,3
22-SHF22_141009_L002,GRASPS_TRL_GFPHigh,GFPHigh,1
23-SHF23_141009_L002,GRASPS_TRL_GFPHigh,GFPHigh,2
24-SHF24_141009_L002,GRASPS_TRL_GFPHigh,GFPHigh,3
19-SHF19_141009_L001,GRASPS_TRL_GFPLow,GFPLow,1
20-SHF20_141009_L002,GRASPS_TRL_GFPLow,GFPLow,2
21-SHF21_141009_L002,GRASPS_TRL_GFPLow,GFPLow,3
16-SHF16_141009_L001,GRASPS_TRL_Q331K,Q331K,1
17-SHF17_141009_L001,GRASPS_TRL_Q331K,Q331K,2
37-SHF18_141009_L002,GRASPS_TRL_Q331K,Q331K,3
HAQ-10,Other_HAQ,HAQ,1
HAQ-11,Other_HAQ,HAQ,1
HAQ-12,Other_HAQ,HAQ,1
HAQ-13,Other_HAQ,HAQ,1
HAQ-14,Other_HAQ,HAQ,1
HAQ-15,Other_HAQ,HAQ,1
HAQ-16,Other_HAQ,HAQ,1
HAQ-1,Other_HAQ,HAQ,1
HAQ-2,Other_HAQ,HAQ,1
HAQ-3,Other_HAQ,HAQ,1
HAQ-4,Other_HAQ,HAQ,1
HAQ-5,Other_HAQ,HAQ,1
HAQ-6,Other_HAQ,HAQ,1
HAQ-7A,Other_HAQ,HAQ,1
HAQ-8,Other_HAQ,HAQ,1
HAQ-9,Other_HAQ,HAQ,1
25-SHF1_141009_L003,WCT_Control,Control,1
26-SHF2_141009_L003,WCT_Control,Control,2
27-SHF3_141009_L003,WCT_Control,Control,3
28-SHF4_141009_L003,WCT_Q331K,Q331K,1
29-SHF5_141009_L003,WCT_Q331K,Q331K,2
30-SHF6_141009_L003,WCT_Q331K,Q331K,3
Hi Mike and Sandeep,
Is there whitespace at the bottom of the file?
I think, yes. Will that interfere with the parsing?
Indeed it causes the error. I removed the whitespaces and rerun it only to be stumped at this one:
/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
warnings.warn(self.msg_depr % (key, alt_key))
[2016-01-11T14:50Z] Resource requests: picard; memory: 3.50; cores: 1
[2016-01-11T14:50Z] Configuring 1 jobs to run, using 1 cores each with 3.50g of memory reserved for each job
[2016-01-11T14:50Z] run local -- checkpoint passed: trimming
[2016-01-11T14:50Z] Timing: organize samples
[2016-01-11T14:50Z] multiprocessing: organize_samples
[2016-01-11T14:50Z] Using input YAML configuration: /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/tdp43_project/config/tdp43_project.yaml
[2016-01-11T14:50Z] Checking sample YAML configuration: /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/tdp43_project/config/tdp43_project.yaml
Traceback (most recent call last):
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a//tools/bin/bcbio_nextgen.py", line 4, in <module>
__import__('pkg_resources').run_script('bcbio-nextgen==0.9.6a0', 'bcbio_nextgen.py')
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/setuptools-19.1.1-py2.7.egg/pkg_resources/__init__.py", line 745, in run_script
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/setuptools-19.1.1-py2.7.egg/pkg_resources/__init__.py", line 1670, in run_script
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio_nextgen-0.9.6a0-py2.7.egg-info/scripts/bcbio_nextgen.py", line 226, in <module>
main(**kwargs)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio_nextgen-0.9.6a0-py2.7.egg-info/scripts/bcbio_nextgen.py", line 43, in main
run_main(**kwargs)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 39, in run_main
fc_dir, run_info_yaml)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 82, in _run_toplevel
for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 233, in rnaseqpipeline
samples = rnaseq_prep_samples(config, run_info_yaml, parallel, dirs, samples)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 357, in rnaseq_prep_samples
[x[0]["description"] for x in samples]]])
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
return run_multicore(fn, items, config, parallel=parallel)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 804, in __call__
while self.dispatch_one_batch(iterator):
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 662, in dispatch_one_batch
self._dispatch(tasks)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 570, in _dispatch
job = ImmediateComputeBatch(batch)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 183, in __init__
self.results = batch()
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 51, in wrapper
return apply(f, *args, **kwargs)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 239, in organize_samples
return run_info.organize(*args)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/run_info.py", line 44, in organize
run_details = _run_info_from_yaml(dirs, run_info_yaml, config, sample_names)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/run_info.py", line 677, in _run_info_from_yaml
_check_sample_config(run_details, run_info_yaml, config)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/run_info.py", line 532, in _check_sample_config
_check_for_duplicates(items, "description")
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/run_info.py", line 293, in _check_for_duplicates
"Problem found in these samples: %s" % (attr, dups, descrs))
ValueError: Duplicate 'description' found in input sample configuration.
Required to be unique for a project: ['CytTrc_Control', 'CytTrc_Q331K', 'GRASPS_TRL_Control', 'GRASPS_TRL_GFPHigh', 'GRASPS_TRL_GFPLow', 'GRASPS_TRL_Q331K', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'WCT_Control', 'WCT_Q331K']
Problem found in these samples: ['CytTrc_Control', 'CytTrc_Control', 'CytTrc_Control', 'CytTrc_Q331K', 'CytTrc_Q331K', 'CytTrc_Q331K', 'GRASPS_TRL_Control', 'GRASPS_TRL_Control', 'GRASPS_TRL_Control', 'GRASPS_TRL_GFPHigh', 'GRASPS_TRL_GFPHigh', 'GRASPS_TRL_GFPHigh', 'GRASPS_TRL_GFPLow', 'GRASPS_TRL_GFPLow', 'GRASPS_TRL_GFPLow', 'GRASPS_TRL_Q331K', 'GRASPS_TRL_Q331K', 'GRASPS_TRL_Q331K', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'Other_HAQ', 'WCT_Control', 'WCT_Control', 'WCT_Control', 'WCT_Q331K', 'WCT_Q331K', 'WCT_Q331K']
So again, back to basics. So if I have replicates, should I be using the 1-sample-multiple-files configuration??
Hi Sandeep,
Are
31-SHF7_141009_L004,CytTrc_Control,Control,1
32-SHF8_141009_L004,CytTrc_Control,Control,2
33-SHF9_141009_L004,CytTrc_Control,Control,3
those the same sample or different samples?
http://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#automated-sample-configuration has some more information about what each field is. samplename
is the stem of the flename and description
is a human-readable description of the sample. description
needs to be different if the samples are different.
If they are the same sample using the 1-sample-multiple-files configuration will work. You can also cat the files together into one file yourself, either will work.
I fixed the whitespace issue so it will skip those lines now instead of failing.
Hi Rory,
31-SHF7_141009_L004,CytTrc_Control,Control,1
32-SHF8_141009_L004,CytTrc_Control,Control,2
33-SHF9_141009_L004,CytTrc_Control,Control,3
These are replicates of each other but by definition are unique samples, aren't they? In either case, shall I have unique Batch IDs or the same? Thanks for fixing the whitespace issue.
Hi Sandeep,
I'm not sure, it depends on what you want to do with the samples. If they are technical replicates or biological replicates that you want to analyze separately, they would need unique descriptions in order to be treated separately. I see you have them labelled as coming from three different batches, so maybe labeling them CytTrc_Control_Batch1, etc would make sense if you want to keep the batches separate. If they are a single sample spread across multiple lanes of sequencing then they'd need to either be combined by catting the FASTQ files together or using the 1-sample-multiple-files. The 1-sample-multiple-files just cats the files together for you, it isn't doing anything fancy.
Hi Rory,
I merged all 3 sample files into one (I did it manually), so that the metadata file looks like this:
samplename,description,phenotype,batch
all_CytTrc_Control,CytTrc_Control,Control,1
all_CytTrc_Q331K,CytTrc_Q331K,Q331K,1
all_GRASPS_TRL,GRASPS_TRL_Control,Control,1
all_GRASPS_TRL,GRASPS_TRL_GFPHigh,GFPHigh,1
all_GRASPS_TRL,GRASPS_TRL_GFPLow,GFPLow,1
all_GRASPS_TRL,GRASPS_TRL_Q331K,Q331K,1
all_WCT_Control,WCT_Control,Control,1
all_WCT_Q331K,WCT_Q331K,Q331K,1
The pipeline starts and now it returns the index not found error with the following message:
/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
warnings.warn(self.msg_depr % (key, alt_key))
[2016-01-11T20:13Z] Resource requests: picard; memory: 3.50; cores: 1
[2016-01-11T20:13Z] Configuring 1 jobs to run, using 1 cores each with 3.50g of memory reserved for each job
[2016-01-11T20:13Z] Timing: organize samples
[2016-01-11T20:13Z] multiprocessing: organize_samples
[2016-01-11T20:13Z] Using input YAML configuration: /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/tdp43_project/config/tdp43_project.yaml
[2016-01-11T20:13Z] Checking sample YAML configuration: /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/tdp43_project/config/tdp43_project.yaml
[2016-01-11T20:13Z] Testing minimum versions of installed programs
[2016-01-11T20:13Z] multiprocessing: prepare_sample
[2016-01-11T20:13Z] Preparing CytTrc_Control
[2016-01-11T20:13Z] Preparing CytTrc_Q331K
[2016-01-11T20:13Z] Preparing GRASPS_TRL_Control
[2016-01-11T20:13Z] Preparing GRASPS_TRL_GFPHigh
[2016-01-11T20:13Z] Preparing GRASPS_TRL_GFPLow
[2016-01-11T20:13Z] Preparing GRASPS_TRL_Q331K
[2016-01-11T20:13Z] Preparing WCT_Control
[2016-01-11T20:13Z] Preparing WCT_Q331K
[2016-01-11T20:13Z] Resource requests: hisat2, picard; memory: 2.00, 3.50; cores: 16, 1
[2016-01-11T20:13Z] Configuring 1 jobs to run, using 1 cores each with 8.00g of memory reserved for each job
[2016-01-11T20:13Z] Timing: alignment
[2016-01-11T20:13Z] multiprocessing: disambiguate_split
[2016-01-11T20:13Z] multiprocessing: process_alignment
[2016-01-11T20:13Z] Aligning lane 1_2016-01-11_tdp43_project with hisat2 aligner
[2016-01-11T20:13Z] Aligning /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/Data/R1/CytTrc_Control/all_CytTrc_Control_R1.gz and /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/Data/R2/CytTrc_Control/all_CytTrc_Control_R2.gz with hisat2.
[2016-01-11T20:13Z] Could not locate a HISAT2 index corresponding to basename "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/genomes/Hsapiens/hg38/hisat2/hg38"
[2016-01-11T20:13Z] Error: Encountered internal HISAT exception (#1)
[2016-01-11T20:13Z] Command: /usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/tools/bin/../../anaconda/bin/hisat2-align-s --wrapper basic-0 -x /usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/genomes/Hsapiens/hg38/hisat2/hg38 -p 1 --phred33 --rg-id 1 --rg PL:illumina --rg PU:1_2016-01-11_tdp43_project --rg SM:CytTrc_Control --known-splicesite-infile /usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/genomes/Hsapiens/hg38/rnaseq/ref-transcripts-splicesites.txt -1 /tmp/3225.inpipe1 -2 /tmp/3225.inpipe2
[2016-01-11T20:13Z] (ERR): hisat2-align exited with value 1
Is bcbio unable to find the indices for hisat2?
Hi @ssamberkar,
Sorry for the problems-- it does look like it cannot find the indices. Are they installed?
ls -l /usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/genomes/Hsapiens/hg38/hisat2
should have the indices in it.
Hi guys
How's this?
ls -l /usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/genomes/Hsapiens/hg38/hisat2
total 16532
-rw-r--r-- 1 fe1mpc app-admins 7942745 Jan 7 08:50 hg38.exons
-rw-r--r-- 1 fe1mpc app-admins 8900759 Jan 7 08:50 hg38.splicesites
Installed via the command
bcbio_nextgen.py upgrade --genomes hg38 --aligners hisat2
On 11 January 2016 at 20:21, Rory Kirchner notifications@github.com wrote:
Hi @ssamberkar https://github.com/ssamberkar,
Sorry for the problems-- it does look like it cannot find the indices. Are they installed?
ls -l /usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/genomes/Hsapiens/hg38/hisat2
should have the indices in it.
— Reply to this email directly or view it on GitHub https://github.com/mikecroucher/bcbio_examples/issues/1#issuecomment-170678303 .
Hi Mike,
Strange. It looks like the hisat2 index did not get installed, could you try installing them with:
bcbio_nextgen.py upgrade --data --genomes hg38 --aligners hisat2
Done with the script:
#!/bin/bash
#$ -l rmem=32G -l mem=32G
#$ -P radiant
module load apps/gcc/5.2/bcbio/0.9.6a
bcbio_nextgen.py upgrade --data --genomes hg38 --aligners hisat2
stderr:
INFO: Reading default fabricrc.txt
DBG [config.py]: Using config file /data/fe1mpc/tmpbcbio-install/cloudbiolinux/cloudbio/../config/fabricrc.txt
INFO: Distribution __auto__
INFO: Get local environment
INFO: ScientificLinux setup
DBG [distribution.py]: NixPkgs: Ignored
INFO: Now, testing connection to host...
INFO: Connection to host appears to work!
DBG [utils.py]: Expand paths
INFO: List of genomes to get (from the config file at '{'install_liftover': False, 'genome_indexes': ['hisat2', 'bwa', 'bowtie2', 'rtg'], 'genomes': [{'rnaseq': True, 'validation': ['platinum-genome-NA12878', 'giab-NA12878-remap', 'giab-NA12878-crossmap', 'dream-syn4-crossmap', 'dream-syn3-crossmap'], 'name': 'Human (hg38) full', 'dbkey': 'hg38', 'annotations': ['dbsnp', 'clinvar', 'mills_indels', '1000g_snps', '1000g_indels', '1000g_omni_snps', 'hapmap_snps', 'coverage', 'prioritize']}], 'install_uniref': False}'): Human (hg38) full
stdout:
Upgrading bcbio-nextgen data files
Setting up virtual machine
[localhost] local: echo $HOME
[localhost] local: uname -m
bcbio-nextgen data upgrade complete.
Upgrade completed successfully.
ls -l /usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/genomes/Hsapiens/hg38/hisat2
total 16532
-rw-r--r-- 1 fe1mpc app-admins 7942745 Jan 12 07:10 hg38.exons
-rw-r--r-- 1 fe1mpc app-admins 8900759 Jan 12 07:10 hg38.splicesites
Same two files but with a new date
Hi Mike, Rory
It would sound stupid but what if we have symlinks to hg38 indices created/downloaded externally to bcbio, placed at this location? Intuitively if it is "only" a case of missing files and hisat2 would scan for a directory identical to the index name at this location??
Although that might solve our immediate issue, I feel that it's papering over the cracks.
I firmly believe that we should do this via bcbio's mechanisms. That's the best long term outcome.
I agree to that, if we have 3-4 different aligners and every time bcbio throws the same error for each aligner, it is an extra circus to copy those files..
Hi Mike and Sandeep,
Sorry, could you remove the partially-installed hisat2 directory and retry the install of the hisat2 index? All this step is doing is downloading a file and unzipping it into that directory.
Best,
Rory
Hi Rory,
If this fails, will the work around I posted above work? hisat2 provides precompiled indices already and if we do place these indices outside of bcbio install, we won't be changing it for quite some time unless need be. As much as we should avoid such practice, we can hope this would get bcbio running.
Is bcbio running the index-build during install or is it just downloading the precompiled ones from hisat2 page? Thanks, Sandeep
Hi Sandeep,
It is downloading an index, but not the one from the hisat2 site, It is downloading indices that we built. It is better to use the bcbio-nextgen installed indices so we know that the annotation matches up to the index.
For doing gene expression runs, using GRCh37 is fine, the hg38 support is pretty new and we haven't fully validated it.
So in short, if I use GRCh37 and hg38, the analysis should proceed normally? I believe GRCh37 is already in place, @mikecroucher could you please confirm?
Hi Sandeep,
You'll need to pick one of the genomes to use for the project, either GRCh37 or hg38, both won't work.
I can install them both though right? The bioinformatics peeps can then choose. On 12 Jan 2016 17:04, "Rory Kirchner" notifications@github.com wrote:
Hi Sandeep,
You'll need to pick one of the genomes to use for the project, either GRCh37 or hg38, both won't work.
— Reply to this email directly or view it on GitHub https://github.com/mikecroucher/bcbio_examples/issues/1#issuecomment-170976607 .
Yup-- you can install however many genomes you want and then choose which one to use by editing the template file.
Hi Rory,
I want to use hg38 but I didn't know yet if it has been validated. I'll of course choose only one of them.. ;-) Good to know this before kicking off anything. I saw hg19 and hg38 folders, GRCh37 needs to be installed.
We have done much more work using hg38 for variant calling where it seems to be good to use, but there is a similar problem where some of the downstream tools don't fully support hg38 yet.
I didn't install GRCh37 because it wasn't asked for. I'll get it added.
Rory...could you help me with the exact upgrade command to use please. On 12 Jan 2016 17:10, "Rory Kirchner" notifications@github.com wrote:
We have done much more work using hg38 for variant calling where it seems to be good to use, but there is a similar problem where some of the downstream tools don't fully support hg38 yet.
— Reply to this email directly or view it on GitHub https://github.com/mikecroucher/bcbio_examples/issues/1#issuecomment-170977940 .
bcbio_nextgen.py upgrade --data --genomes GRCh37 --aligners bwa --aligners star
should get you going.
Right, seems strange that it hasn't been mainstream yet. It is already 2 year old.. @mikecroucher -- Sorry about the confusion mike, I was under the impression that hg38 availability was validated. Since Rory has confirmed otherwise, we would wait until their nod for it.
Hi Rory, So we now have GRCh37 installed but bcbio throws a new error now:
/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
warnings.warn(self.msg_depr % (key, alt_key))
[2016-01-12T19:28Z] Resource requests: picard; memory: 3.50; cores: 1
[2016-01-12T19:28Z] Configuring 8 jobs to run, using 1 cores each with 3.50g of memory reserved for each job
[2016-01-12T19:28Z] run local -- checkpoint passed: trimming
[2016-01-12T19:28Z] Timing: organize samples
[2016-01-12T19:28Z] multiprocessing: organize_samples
[2016-01-12T19:28Z] Using input YAML configuration: /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/tdp43_project/config/tdp43_project.yaml
[2016-01-12T19:28Z] Checking sample YAML configuration: /shared/hidelab2/user/md4zsa/Work/TDP_Omics_Study/tdp43_project/config/tdp43_project.yaml
[2016-01-12T19:28Z] Downloading GRCh37 samtools from AWS
Traceback (most recent call last):
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a//tools/bin/bcbio_nextgen.py", line 4, in <module>
__import__('pkg_resources').run_script('bcbio-nextgen==0.9.6a0', 'bcbio_nextgen.py')
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/setuptools-19.1.1-py2.7.egg/pkg_resources/__init__.py", line 745, in run_script
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/setuptools-19.1.1-py2.7.egg/pkg_resources/__init__.py", line 1670, in run_script
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio_nextgen-0.9.6a0-py2.7.egg-info/scripts/bcbio_nextgen.py", line 226, in <module>
main(**kwargs)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio_nextgen-0.9.6a0-py2.7.egg-info/scripts/bcbio_nextgen.py", line 43, in main
run_main(**kwargs)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 39, in run_main
fc_dir, run_info_yaml)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 82, in _run_toplevel
for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 233, in rnaseqpipeline
samples = rnaseq_prep_samples(config, run_info_yaml, parallel, dirs, samples)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 357, in rnaseq_prep_samples
[x[0]["description"] for x in samples]]])
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
return run_multicore(fn, items, config, parallel=parallel)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
for data in joblib.Parallel(parallel["num_jobs"])(joblib.delayed(fn)(x) for x in items):
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 804, in __call__
while self.dispatch_one_batch(iterator):
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 662, in dispatch_one_batch
self._dispatch(tasks)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 570, in _dispatch
job = ImmediateComputeBatch(batch)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 183, in __init__
self.results = batch()
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 51, in wrapper
return apply(f, *args, **kwargs)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 239, in organize_samples
return run_info.organize(*args)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/run_info.py", line 58, in organize
item = add_reference_resources(item)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/run_info.py", line 145, in add_reference_resources
data["reference"] = genome.get_refs(data["genome_build"], aligner, data["dirs"]["galaxy"], data)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/genome.py", line 199, in get_refs
galaxy_config, data)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/genome.py", line 151, in _get_ref_from_galaxy_loc
cur_ref = download_prepped_genome(genome_build, data, name, need_remap)
File "/usr/local/packages6/apps/gcc/5.2/bcbio/0.9.6a/anaconda/lib/python2.7/site-packages/bcbio/pipeline/genome.py", line 261, in download_prepped_genome
raise ValueError("Could not find reference genome file %s %s" % (genome_build, name))
ValueError: Could not find reference genome file GRCh37 samtools
Suspect it is now searching for GRCh37.fasta?
steady on there Sandeep....it's still building
Ouch.. sorry.. I just looked if there were GRCh37 folder and files and I kicked off..
No worries. Stuck on STAR for the last 35 minutes or so...still going:
Resolving s3.amazonaws.com... 54.231.15.16
Connecting to s3.amazonaws.com|54.231.15.16|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2016-01-12 19:02:15 ERROR 404: Not Found.
Warning: local() encountered an error (return code 8) while executing 'wget --continue --no-check-certificate -O GRCh37-star.tar.xz 'https://s3.amazonaws.com/biodata/genomes/GRCh37-star.tar.xz''
INFO: Genome preparation method s3 failed, trying next
INFO: Preparing genome GRCh37 with index star
It should take less than a couple hours to do the STAR index. If it is still stalling out we can open up an issue on the rna-star repository (https://github.com/alexdobin/STAR) and try to figure out what is going on with Alex. We'll need the Log.out file that STAR generates and some information about your system to figure out what is going on.
Yep. Stuck on STAR.
Out.log sent to Rory and Brad.
Thanks to Rory and Brad, I think we may be in business. The modifications I've made to the install are reflected in
http://rcg.group.shef.ac.uk/iceberg/software/apps/bcbio.html
Sandeep - could you have a try please?
Surely Mike, I just fired our first analysis after a quick sanity check. Hopefully it should run through just fine! Fingers crossed.. One parting query to Rory -- since you mentioned hisat2 will soft clip adapters, is the cutadapt step be excluded from the template for future runs? Will running cutadapt with hisat2 impact the results for the worse? Thanks all for team work. Big thumbs up!!
Hi Sandeep,
That's right-- there is not much need to clip adapters if using any aligner other than Tophat2 if you are just looking at gene expression. It can help with transcriptome assembly so if you are doing that it might help to click it on.
The current script craps out on Sandeep's example dataset