uec / Issue.Tracker

Automatically exported from code.google.com/p/usc-epigenome-center
0 stars 0 forks source link

Merge TCGA libraries #359

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
I would like to have the statistics on this for the meeting.  Can we start a 
merge today?  I am attaching a spreadsheet with all the lanes to merge (by TCGA 
name, since maybe some samples have multiple libraries).

Original issue reported on code.google.com by benb...@gmail.com on 20 Nov 2012 at 6:40

GoogleCodeExporter commented 8 years ago
Sorry, i think i didn't save the attachment before sending.  Use this new 
attachment

Original comment by benb...@gmail.com on 20 Nov 2012 at 6:53

Attachments:

GoogleCodeExporter commented 8 years ago
when is the meeting. merging large bis datasets and calculating the metrics can 
take days of cpu time.

Original comment by zack...@gmail.com on 20 Nov 2012 at 7:14

GoogleCodeExporter commented 8 years ago
I will have to compile the data over this weekend (I leave monday for meeting). 
 Once the merged BAMs are done, i have my own scripts that I can run to output 
coverage levels.  I assume the merged BAMs can be output pretty quickly, and 
the QC metrics and Bis-SNP will lag.

Original comment by benb...@gmail.com on 20 Nov 2012 at 7:35

GoogleCodeExporter commented 8 years ago
ok, that should be doable.

Original comment by zack...@gmail.com on 20 Nov 2012 at 8:05

GoogleCodeExporter commented 8 years ago
currently running on the cluster

Original comment by zack...@gmail.com on 20 Nov 2012 at 10:33

GoogleCodeExporter commented 8 years ago
some of the smaller bams are done.
 I've noticed that our bam-merging bis workflow is now overwriting the readgroups with a single new read-group. this is bad since we lose track of the original lanes. I'm writing a fix for it now that will preserve the old readgroups...

once complete, est~ a few hours, i will cancel all runs and restart. 

Original comment by zack...@gmail.com on 21 Nov 2012 at 5:52

GoogleCodeExporter commented 8 years ago
It looks like 4 of the 13 crashed during Bis-SNP, or at least didn't complete?  
I can't see the error logs, because I don't have permissions.  Here's an example

[bberman@hpc-uec:~/production-gs1/ga/analysis/Bisulfite_merge_2012-11-20] $ ls 
/export/uec-gs1/laird/shared/production/ga/analysis/Bisulfite_merge_2012-11-20/*
A18*bissnp*
-rw------- 1 ramjan hsc-ar 14K Nov 22 04:46 
/export/uec-gs1/laird/shared/production/ga/analysis/Bisulfite_merge_2012-11-20/u
ec_MERGING_MERGING_1_NIC1254A18_uscec_bissnp445043964234484988.sh.e2969495
-rw------- 1 ramjan hsc-ar 28K Nov 22 04:46 
/export/uec-gs1/laird/shared/production/ga/analysis/Bisulfite_merge_2012-11-20/u
ec_MERGING_MERGING_1_NIC1254A18_uscec_bissnp445043964234484988.sh.o2969495

Original comment by benb...@gmail.com on 24 Nov 2012 at 6:18

GoogleCodeExporter commented 8 years ago
all files now group readable

Original comment by zack...@gmail.com on 24 Nov 2012 at 7:53

GoogleCodeExporter commented 8 years ago
Here is the error:
##### ERROR MESSAGE: SAM/BAM file 
SAMFileReader{/export/uec-gs1/laird/shared/production/ga/analysis/Bisulfite_merg
e_2012-11-20/results/2969703.hpc-pbs.usc.edu/ResultCount_MERGING_1_NIC1254A16.hg
19_rCRSchrm.fa.bam} is malformed: Read HWI-ST550_0142
:6:1301:7555:110477#0 is either missing the read group or its read group is not 
defined in the BAM header, both of which are required by the GATK.  Please use 
http://www.broadinstitute.org/gsa/wiki/index.php/ReplaceReadGroups to fix this 
problem

I looked at the BAM and it looks like indeed there are reads with read groups 
and reads without.  I assume maybe this is because some of the older input 
files didn't have read groups?

[bberman@hpc-uec:~/production-gs1/ga/analysis/Bisulfite_merge_2012-11-20] $ 
samtools view 
results/MERGING/MERGING_1_NIC1254A16/ResultCount_MERGING_1_NIC1254A16.hg19_rCRSc
hrm.fa.bam | grep 'HWI-ST550_0142:6:1301:7555:110477#0' 
HWI-ST550_0142:6:1301:7555:110477#0     163     chr1    98333   255     50M     
=       98580   297     CTCACTCACTTTTCTCCTTCTACTATTACTGCTCATTCATTCCAATTTTT      
CCCFFFFFHHHHHJJJJJJJJJJJJJJJIJJIJJIJIJIIJJJJJJJJJJ      NM:i:0  ZS:Z:--
HWI-ST550_0142:6:1301:7555:110477#0     83      chr1    98580   255     50M     
=       98333   -297    ATATTCACTTCAACTCTACTAACATTTAATAAATATTATTAACTAACTAA      
IIJJGHHIHHIHHIHGIHJGIHJJJJJJJHJJJJJJJHHHHHFFFDFCCC      NM:i:0  ZS:Z:-+

Original comment by benb...@gmail.com on 24 Nov 2012 at 8:02

GoogleCodeExporter commented 8 years ago
since readgroups are missing for certain old lanes and we cant add the 
readgroups to the merged bam since we dont know which read belongs to which, 
the only options are:

- rerun those old lanes through the latest pipeline and remerge
- add readgroups manually and rerun merging
- update my merging pipeline to try and detected mixed cases like this (it will 
currently detect either/or) and remerge
- squish the merged bams into one readgroup and run bissnp.
- does bissnp have a "-ignore-readgroups" flag, if so, just rerun that step

all these fixes except the last require a decent chunk of time to implement/test

I dont know whats necessary for the meeting, but I'm headed out and wont be at 
a terminal until tomorrow at the earliest.

Original comment by zack...@gmail.com on 24 Nov 2012 at 8:16

GoogleCodeExporter commented 8 years ago
issue 363, which I've fixed and is being tested on this dataset will resolve 
the problems mentioned above. 

Original comment by zack...@gmail.com on 28 Nov 2012 at 10:05

GoogleCodeExporter commented 8 years ago
BisSNP now do not have "-ignore-readgroups" flag, only the old version based on 
GATK1.0 framework could do this job..

Original comment by lyping1...@gmail.com on 28 Nov 2012 at 10:28

GoogleCodeExporter commented 8 years ago
@12 
i guess it doesn't matter now since I've redone the merging code to insert a RG 
when a non-RG is merged with a with-RG.

if they are all non-RG then we stick a single RG on the merged result, such as 
when splitting a fastq into pieces in the pipeline.

Original comment by zack...@gmail.com on 28 Nov 2012 at 10:39

GoogleCodeExporter commented 8 years ago
fixing issue 363 will result in the completion of this task.

this dataset is the testcase for #363

Original comment by zack...@gmail.com on 29 Nov 2012 at 11:26

GoogleCodeExporter commented 8 years ago
We need to re-run , because some directories still don't have BISSNP output:

/Volumes/storage/hpcc/uec-gs1/laird/shared/production/ga/analysis/Bisulfite_merg
e_2012-11-27/results/MERGING/MERGING_1_NIC1254A15

Original comment by benb...@gmail.com on 17 Jan 2013 at 11:12

GoogleCodeExporter commented 8 years ago

Original comment by zack...@gmail.com on 21 Feb 2013 at 8:00