uec / Issue.Tracker

Automatically exported from code.google.com/p/usc-epigenome-center
0 stars 0 forks source link

Get ecdp to recognize H121H9BGXX and 140828_NS500449_0002_AH12H9BGXX #816

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
This run is similar to how things could look from now on so let's try to make 
ECDP recognize as much as possible

Original issue reported on code.google.com by cmnico...@gmail.com on 24 Sep 2014 at 11:45

GoogleCodeExporter commented 8 years ago
The new output format is very confusing

Few problems with the flowcell H12H9BGXX:

Problem #1. 
How do I figure out which fastq corresponds to which sample?

In OLD runs (HiSeq runs, recent format) the sample name was included into a 
fastq file name. For example: XRT_2-24-14_ATTCCT_L006_R1_001.fastq.gz . Name 
XRT_2-24-14 is a part of a fastq name

In this flowcell run it is NOT included. For example: 
K50_S2_L000_R1_001.fastq.gz and K50_S2_L000_R2_001.fastq.gz are sample's 
RNAcomp1 fastqs, but RNAcomp1 is NOT a part of a fastq name

So, to make changes in the code I need to know how you are associating fastq 
names with sample names.  Otherwise its not clear how to match fastq's to 
samples.

Problem #2. 
Sample names entered in Genologics do not match sample names  under 
flowcells/results/sample analysis dirs. 
For example:
Samples names from genologics
           RNAseq_TruSeq_1000ng_SSIII
           RNAseq_TruSeq_1000ng_SSII    
           RNAseq_KAPA_50ng 
           RNAseq_TruSeq_50ng_SSIII 
           RNAseq_KAPA_1000ng   
           RNAseq_TruSeq_50ng_SSII

Sample names in analysis directory:
           RNAcomp1
           RNAcomp2
           RNAcomp3
           RNAcomp4
           RNAcomp5
           RNAcomp6

How to associate Genologics sample names with sample analysis dir names?

Problem #3 
Sample analysis dirs format "flowcell_lane_sampleName" 
Ex:(H12H9BGXX_0_RNAcomp1)  is different from our usual format 
"flowcell_lane_limsID" Ex:(C4WTGACXX_1_CAR1734A10)
Was that change absolutely necessary? If not I would prefer to keep the usual 
format if that's possible.

Original comment by natalia....@gmail.com on 30 Sep 2014 at 10:53

GoogleCodeExporter commented 8 years ago
geneus IDs are key here. sample names are irrelevant since they are not distinct

Original comment by zack...@gmail.com on 30 Sep 2014 at 11:13

GoogleCodeExporter commented 8 years ago
To which problem are you referring to?

Original comment by natalia....@gmail.com on 30 Sep 2014 at 11:48

GoogleCodeExporter commented 8 years ago
for #1 :
Let's say I want to make fastq's to show up for samples without analysis. I 
have geneusID parsed from Genologics and fastq's under runs dir named 
"Undetermined_S0_L000_R1_001.fastq.gz" 
How would you associate the geneusID with that fastq?

Original comment by natalia....@gmail.com on 30 Sep 2014 at 11:56

GoogleCodeExporter commented 8 years ago
with casava, it used to have the barcode in the filename, so we could use that 
to determine the sample. Now, it looks like this is no longer the case. 

The only way I see is to use the Stats/ dir and parse the xml:

..
<Sample name="ADL_GB">
        <Barcode name="GTCCGC">
..

The barcode will be in geneus, and lanes cab be ignored since it is basically a 
single lane. 

pseudocode outline:
for all fastqs
$d = readlink to find abs path
$xml = dirname($d)/Stats/DemultiplexingStats.xml
find matching barcode from geneus in XML
if match, return matching fastq. 

Original comment by zack...@gmail.com on 1 Oct 2014 at 2:36

GoogleCodeExporter commented 8 years ago
Testing

Original comment by natalia....@gmail.com on 9 Oct 2014 at 10:06

GoogleCodeExporter commented 8 years ago
Hi Zack,
for H0VUPAGXX (NextSeq flowcell) that has analysis, getLegacyQCForLib is not 
returning a qcreport because the laneNum value is set to 0 in 
H0VUPAGXX_qcmetrics.csv files for samples (see below for CAP1727A56):

[natalia@epifire2 H0VUPAGXX_1_CAP1727A56]$ cat H0VUPAGXX_qcmetrics.csv
FlowCelln,laneNum,nocontamSeqs,contamSeqs,contamPolyaSeqs,contamAdaptersSeqs,con
tamAdapterTrimSeqs,GAATGGAATG,TATTTTATTT,CATTCCATTC
/export/uec-gs1/laird/shared/production/ga/flowcells/H0VUPAGXX/results/H0VUPAGXX
/H0VUPAGXX_1_CAP1727A56,0,64325180,1368,815,58,495,192856,59089,187431

I think this is because of the way the pipeline parses the workflow parameter 
file, but I am not sure. Could you please tell us why the lane number is set to 
zero in this case?

Original comment by natalia....@gmail.com on 10 Oct 2014 at 10:27

GoogleCodeExporter commented 8 years ago
because there are no lane numbers with nextSeq, so the default is zero.

-zack

Original comment by zack...@gmail.com on 10 Oct 2014 at 10:29

GoogleCodeExporter commented 8 years ago
Is that default value set up somewhere in pipeline programs?

Original comment by natalia....@gmail.com on 10 Oct 2014 at 10:31

GoogleCodeExporter commented 8 years ago
I believe its from the fastq file or LIMS or workflowparams. Could be a default 
elsewhere though.

-zack

Original comment by zack...@gmail.com on 10 Oct 2014 at 10:33

GoogleCodeExporter commented 8 years ago
workflowparams and lims have it set to 1  and this is why the qcreport is not 
returned.
What would you recommend as a possible solution? I ended up substituting 0 with 
one inside of qc file, but maybe there are better and cleaner ways?

Original comment by natalia....@gmail.com on 10 Oct 2014 at 10:37

GoogleCodeExporter commented 8 years ago
I would probably fix the code to allow for a 0 as a valid lane. 

Original comment by zack...@gmail.com on 10 Oct 2014 at 10:38

GoogleCodeExporter commented 8 years ago
After getting flowcells from genologics the script is using this lane 
information that and its "1" , so I can't do anything to fix that part and 
allow 0 as a valid lane

Original comment by natalia....@gmail.com on 10 Oct 2014 at 10:43