Open alkaZeltser opened 3 years ago
I'm not an expert of FASTQ header. But seems like the read_groupt_identifer
and platform_unit
don't have to be the same. Here is am example of the input CSV file created from a CPTAC BAM. Seems like their read_group_identifier
is just the first 4 letters of the platform_unit
, if some extra characters (don't know where it comes from) to solve conflict. The platform_uni
doesn't have to be unique, and I guess GATK is using it internally. @tyamaguchi-ucla can probably give more smart comments.
Also, for the sample field, would I use the internal sample ID or the original (external) sample ID?
I had the same question in the uclahs-cds/pipeline-germline-somatic pipeline. Although we just need to make sure that the correct sample name is used in call-gSNP, but might be good to use our internal ID? @tyamaguchi-ucla
Hi guys, @zhuchcn @alkaZeltser
As discussed in the last NF WG (https://confluence.mednet.ucla.edu/display/BOUTROSLAB/2021-10-20+Nextflow+Working+Group+Meeting+Notes ), I suggest that we use something like
read_group_identifier
-> library_identifier
+ '.' + lane #
(it can be retrieved from FASTQ) (must be unique and required for BQSR)sequencing_center
-> This one is almost impossible to automate. I guess the samples were sequenced at UNGC?library_identifier
-> Usually in file name (required for Markduplicates) - hard to automate.platform_unit
-> flowcell_id
(see FASTQ read IDs) + '.' + lane#
(it can be retrieved from FASTQ)So, here we probably want to update the read_group_identifier
although the current CSV would work perfectly fine.
Also, for the sample field, would I use the internal sample ID or the original (external) sample ID?
I had the same question in the uclahs-cds/pipeline-germline-somatic pipeline. Although we just need to make sure that the correct sample name is used in call-gSNP, but might be good to use our internal ID? @tyamaguchi-ucla
Yeah, I was thinking about this as well. I think it would be nice to have both external/internal ID in the BAM header so I'm thinking of using
Sample
->Internal ID
and we could include External ID
in library_identifier
, which will be passed to RG identifier
.
For lanes, I think we may want to standardize the field and use L00
+ lane number
for readability instead of using integer.
https://samtools.github.io/hts-specs/SAMv1.pdf https://en.wikipedia.org/wiki/FASTQ_format https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups Sentieon® recommendations
@tyamaguchi-ucla Hi, I have a problem with the read_group_identifier convention. It says: read_group_identifier -> library_identifier + '.' + lane # (it can be retrieved from FASTQ) (must be unique and required for BQSR)
but there is a case where I don't have unique read_group_identifiers if I use this convention.
Example: less /hot/data/unregistered/Movember-Hiyari-Bone-GAP62/2021-9126/BE-1-Blood_L001_ds.aee418f085464ff89488437ce340b52a/BE-1-Blood_S3_L001_R1_001.fastq.gz | head -n1 @A00817:341:HMNVTDRXY:1:2101:1271:1000 1:N:0:ACAGGTAT+ATGGTGGC
less /hot/data/unregistered/Movember-Hiyari-Bone-GAP62/2021-9126/BE-1-Blood_L001_ds.018ae688d10d49f7bdca5bb4932df2ab/BE-1-Blood_S3_L001_R1_001.fastq.gz | head -n1 @A00817:337:H5VJWDSX3:1:1101:3134:1000 1:N:0:ACAGGTAT+ATGGTGGC
Using this library_identifier + '.' + lane # convention, I will end up two with non-unique read_group_identifiers: BE-1-Blood.1 (both library name is BE-1-Blood and lane is lane 1.) Please could you advise? Thank you!
@graceooh isn't it the same case we saw in the PRESTO dataset (the same library sequenced twice using the same lane)? Maybe you can check with Sarah and see how the samples were processed?
Yes that's right! OK I'll check and add -01 like we did for PRESTO then. Thanks!
I'm constructing a csv input file for a sample (from an ILLUMINA sequencer) and am a little confused about some of the fields.
Referencing gatk, it seems that there is a lot of redundancy in the input fields. For example both the
read_group_identifier
(ID) andplatform_unit
(PU) fields are constructed using the flowcell ID and lane number (for ILLUMINA reads). Then the lane number is provided separately as another field, to be concatenated with the ID field. Therefore the ID field should really just be the flowcell in my case?Also, for the
sample
field, would I use the internal sample ID or the original (external) sample ID?For example, an input csv for a sample with the following fastq filename:
FD00123067_S14_L001_R1_001.fastq.gz
And the following fastq header :
@A00817:312:HKTWMDRXY:1:1101:3106:1016 1:N:0:GATAGGCCGA+GCCATGTGCG
parsed using the following ILLUMINA header format:
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number>
Would look like this: