theodorc / Atlas-CNV

Method to detect exonic CNVs in NGS Gene Targeted Panels
Other
16 stars 5 forks source link

Requesting information on how to create sample and panel files. #9

Open vlakhujani opened 4 years ago

vlakhujani commented 4 years ago

Where can I find more information on how to create the panel and the sample fies?

I went through the paper and it says

The first file is not mentioned in the github ReadMe

I am really confused. Please help.

theodorc commented 4 years ago

(1) we use version 3 of the GATK software from Broad Institute to compute the Depth of Coverage on a given bam file.

vlakhujani commented 4 years ago

@theodorc

I am looking at the usage doc. Where is the GATK v3 file used as input ?

Additionally, how do I create panel file?

Exon_Target          Gene_Exon      Call_CNV  RefSeq
1:1220087-1220186    SNP_1          N         rs2144440
1:3083663-3083762    SNP_2          N         rs2651899
1:3611843-3611942    SNP_3          N         rs3765731
1:6279321-6279420    RNF207-001_18  N         rs846111
1:8487274-8487373    SNP_4          N         rs301797
1:11850737-11850955  MTHFR-001_11   Y         NM_005957_cds_0
1:11851264-11851383  MTHFR-001_10   Y         NM_005957_cds_1
1:11852335-11852436  MTHFR-001_9    Y         NM_005957_cds_2
1:11853964-11854146  MTHFR-001_8    Y         NM_005957_cds_3

The Gene_Exon column contains what ? SNP Ids or gene / exon ids? Also, the "RefSeq" column contains dbsnp rs ids ? is that correct ? I also see NM ids (transcript ids)?

And finally, Call_CNVs column contains yes/no values - how to make that decision?

theodorc commented 4 years ago

Sorry for the late response. Hope the comments below helps.

  1. For GATK, see the config file. In there is variable to specify the directory (and file name format) where you have the GATK Depth of Coverage file: GATKDIR=GATK_DoC/[SAMPLE_FCLBC].DATA.sample_interval_summary

  2. Panel file is created by yourself in your favorite editor. It is usually based on the capture designed you used for the sequencing. For example, a cancer panel will contain genes for cancer and their exon target coordinates etc...

  3. The Gene_Exon column is the name of the target exon used. In the example, I used gene MTHFR and -001 for the transcript id, and _11 for exon. The same idea for RefSeq column.

  4. Finally the Call_CNV is designates whether you want to include this given target in the analysis. Usually you say N if you know somehow this target is not reliable when the data is produced (ie. target is too small or data is known to be noisy).