Input csv fields: redundancy ok?

alkaZeltser commented 3 years ago

I'm constructing a csv input file for a sample (from an ILLUMINA sequencer) and am a little confused about some of the fields.

Referencing gatk, it seems that there is a lot of redundancy in the input fields. For example both the read_group_identifier (ID) and platform_unit (PU) fields are constructed using the flowcell ID and lane number (for ILLUMINA reads). Then the lane number is provided separately as another field, to be concatenated with the ID field. Therefore the ID field should really just be the flowcell in my case?

Also, for the sample field, would I use the internal sample ID or the original (external) sample ID?

For example, an input csv for a sample with the following fastq filename: FD00123067_S14_L001_R1_001.fastq.gz

And the following fastq header : @A00817:312:HKTWMDRXY:1:1101:3106:1016 1:N:0:GATAGGCCGA+GCCATGTGCG

parsed using the following ILLUMINA header format: @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number>

Would look like this:

index	read_group_identifier	sequencing_center	library_identifier	platform_technology	platform_unit	sample	lane	read1_fastq	read2_fastq
1	HKTWMDRXY	UNGC	FD00123067	ILLUMINA	HKTWMDRXY.1	FD00123067(original) or BZPRGPT1000001-N001-B01-F(internal ID)	1	/path/to/fastq/pair/r1.fastq.gz	/path/to/fastq/pair/r2.fastq.gz

zhuchcn commented 3 years ago

I'm not an expert of FASTQ header. But seems like the read_groupt_identifer and platform_unit don't have to be the same. Here is am example of the input CSV file created from a CPTAC BAM. Seems like their read_group_identifier is just the first 4 letters of the platform_unit, if some extra characters (don't know where it comes from) to solve conflict. The platform_uni doesn't have to be unique, and I guess GATK is using it internally. @tyamaguchi-ucla can probably give more smart comments.

Also, for the sample field, would I use the internal sample ID or the original (external) sample ID?

I had the same question in the uclahs-cds/pipeline-germline-somatic pipeline. Although we just need to make sure that the correct sample name is used in call-gSNP, but might be good to use our internal ID? @tyamaguchi-ucla

tyamaguchi-ucla commented 3 years ago

Hi guys, @zhuchcn @alkaZeltser

As discussed in the last NF WG (https://confluence.mednet.ucla.edu/display/BOUTROSLAB/2021-10-20+Nextflow+Working+Group+Meeting+Notes ), I suggest that we use something like

read_group_identifier -> library_identifier + '.' + lane # (it can be retrieved from FASTQ) (must be unique and required for BQSR)
sequencing_center -> This one is almost impossible to automate. I guess the samples were sequenced at UNGC?
library_identifier -> Usually in file name (required for Markduplicates) - hard to automate.
platform_unit -> flowcell_id (see FASTQ read IDs) + '.' + lane# (it can be retrieved from FASTQ)

So, here we probably want to update the read_group_identifier although the current CSV would work perfectly fine.

Also, for the sample field, would I use the internal sample ID or the original (external) sample ID?

I had the same question in the uclahs-cds/pipeline-germline-somatic pipeline. Although we just need to make sure that the correct sample name is used in call-gSNP, but might be good to use our internal ID? @tyamaguchi-ucla

Yeah, I was thinking about this as well. I think it would be nice to have both external/internal ID in the BAM header so I'm thinking of using

Sample ->Internal ID and we could include External ID in library_identifier, which will be passed to RG identifier.

For lanes, I think we may want to standardize the field and use L00 + lane number for readability instead of using integer.

Some references

https://samtools.github.io/hts-specs/SAMv1.pdf https://en.wikipedia.org/wiki/FASTQ_format https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups Sentieon® recommendations

https://support.sentieon.com/appnotes/read_groups/

graceooh commented 2 years ago

@tyamaguchi-ucla Hi, I have a problem with the read_group_identifier convention. It says: read_group_identifier -> library_identifier + '.' + lane # (it can be retrieved from FASTQ) (must be unique and required for BQSR)

but there is a case where I don't have unique read_group_identifiers if I use this convention.

Example: less /hot/data/unregistered/Movember-Hiyari-Bone-GAP62/2021-9126/BE-1-Blood_L001_ds.aee418f085464ff89488437ce340b52a/BE-1-Blood_S3_L001_R1_001.fastq.gz | head -n1 @A00817:341:HMNVTDRXY:1:2101:1271:1000 1:N:0:ACAGGTAT+ATGGTGGC

less /hot/data/unregistered/Movember-Hiyari-Bone-GAP62/2021-9126/BE-1-Blood_L001_ds.018ae688d10d49f7bdca5bb4932df2ab/BE-1-Blood_S3_L001_R1_001.fastq.gz | head -n1 @A00817:337:H5VJWDSX3:1:1101:3134:1000 1:N:0:ACAGGTAT+ATGGTGGC

Using this library_identifier + '.' + lane # convention, I will end up two with non-unique read_group_identifiers: BE-1-Blood.1 (both library name is BE-1-Blood and lane is lane 1.) Please could you advise? Thank you!

tyamaguchi-ucla commented 2 years ago

@graceooh isn't it the same case we saw in the PRESTO dataset (the same library sequenced twice using the same lane)? Maybe you can check with Sarah and see how the samples were processed?

graceooh commented 2 years ago

Yes that's right! OK I'll check and add -01 like we did for PRESTO then. Thanks!

uclahs-cds / pipeline-align-DNA

Input csv fields: redundancy ok? #143