Open GoogleCodeExporter opened 8 years ago
The new output format is very confusing
Few problems with the flowcell H12H9BGXX:
Problem #1.
How do I figure out which fastq corresponds to which sample?
In OLD runs (HiSeq runs, recent format) the sample name was included into a
fastq file name. For example: XRT_2-24-14_ATTCCT_L006_R1_001.fastq.gz . Name
XRT_2-24-14 is a part of a fastq name
In this flowcell run it is NOT included. For example:
K50_S2_L000_R1_001.fastq.gz and K50_S2_L000_R2_001.fastq.gz are sample's
RNAcomp1 fastqs, but RNAcomp1 is NOT a part of a fastq name
So, to make changes in the code I need to know how you are associating fastq
names with sample names. Otherwise its not clear how to match fastq's to
samples.
Problem #2.
Sample names entered in Genologics do not match sample names under
flowcells/results/sample analysis dirs.
For example:
Samples names from genologics
RNAseq_TruSeq_1000ng_SSIII
RNAseq_TruSeq_1000ng_SSII
RNAseq_KAPA_50ng
RNAseq_TruSeq_50ng_SSIII
RNAseq_KAPA_1000ng
RNAseq_TruSeq_50ng_SSII
Sample names in analysis directory:
RNAcomp1
RNAcomp2
RNAcomp3
RNAcomp4
RNAcomp5
RNAcomp6
How to associate Genologics sample names with sample analysis dir names?
Problem #3
Sample analysis dirs format "flowcell_lane_sampleName"
Ex:(H12H9BGXX_0_RNAcomp1) is different from our usual format
"flowcell_lane_limsID" Ex:(C4WTGACXX_1_CAR1734A10)
Was that change absolutely necessary? If not I would prefer to keep the usual
format if that's possible.
Original comment by natalia....@gmail.com
on 30 Sep 2014 at 10:53
geneus IDs are key here. sample names are irrelevant since they are not distinct
Original comment by zack...@gmail.com
on 30 Sep 2014 at 11:13
To which problem are you referring to?
Original comment by natalia....@gmail.com
on 30 Sep 2014 at 11:48
for #1 :
Let's say I want to make fastq's to show up for samples without analysis. I
have geneusID parsed from Genologics and fastq's under runs dir named
"Undetermined_S0_L000_R1_001.fastq.gz"
How would you associate the geneusID with that fastq?
Original comment by natalia....@gmail.com
on 30 Sep 2014 at 11:56
with casava, it used to have the barcode in the filename, so we could use that
to determine the sample. Now, it looks like this is no longer the case.
The only way I see is to use the Stats/ dir and parse the xml:
..
<Sample name="ADL_GB">
<Barcode name="GTCCGC">
..
The barcode will be in geneus, and lanes cab be ignored since it is basically a
single lane.
pseudocode outline:
for all fastqs
$d = readlink to find abs path
$xml = dirname($d)/Stats/DemultiplexingStats.xml
find matching barcode from geneus in XML
if match, return matching fastq.
Original comment by zack...@gmail.com
on 1 Oct 2014 at 2:36
Testing
Original comment by natalia....@gmail.com
on 9 Oct 2014 at 10:06
Hi Zack,
for H0VUPAGXX (NextSeq flowcell) that has analysis, getLegacyQCForLib is not
returning a qcreport because the laneNum value is set to 0 in
H0VUPAGXX_qcmetrics.csv files for samples (see below for CAP1727A56):
[natalia@epifire2 H0VUPAGXX_1_CAP1727A56]$ cat H0VUPAGXX_qcmetrics.csv
FlowCelln,laneNum,nocontamSeqs,contamSeqs,contamPolyaSeqs,contamAdaptersSeqs,con
tamAdapterTrimSeqs,GAATGGAATG,TATTTTATTT,CATTCCATTC
/export/uec-gs1/laird/shared/production/ga/flowcells/H0VUPAGXX/results/H0VUPAGXX
/H0VUPAGXX_1_CAP1727A56,0,64325180,1368,815,58,495,192856,59089,187431
I think this is because of the way the pipeline parses the workflow parameter
file, but I am not sure. Could you please tell us why the lane number is set to
zero in this case?
Original comment by natalia....@gmail.com
on 10 Oct 2014 at 10:27
because there are no lane numbers with nextSeq, so the default is zero.
-zack
Original comment by zack...@gmail.com
on 10 Oct 2014 at 10:29
Is that default value set up somewhere in pipeline programs?
Original comment by natalia....@gmail.com
on 10 Oct 2014 at 10:31
I believe its from the fastq file or LIMS or workflowparams. Could be a default
elsewhere though.
-zack
Original comment by zack...@gmail.com
on 10 Oct 2014 at 10:33
workflowparams and lims have it set to 1 and this is why the qcreport is not
returned.
What would you recommend as a possible solution? I ended up substituting 0 with
one inside of qc file, but maybe there are better and cleaner ways?
Original comment by natalia....@gmail.com
on 10 Oct 2014 at 10:37
I would probably fix the code to allow for a 0 as a valid lane.
Original comment by zack...@gmail.com
on 10 Oct 2014 at 10:38
After getting flowcells from genologics the script is using this lane
information that and its "1" , so I can't do anything to fix that part and
allow 0 as a valid lane
Original comment by natalia....@gmail.com
on 10 Oct 2014 at 10:43
Original issue reported on code.google.com by
cmnico...@gmail.com
on 24 Sep 2014 at 11:45