hollygene / TE_MA

S. paradoxus TE MA experiment
0 stars 0 forks source link

Cross-Contamination Analysis #4

Open hollygene opened 4 years ago

hollygene commented 4 years ago

Looked at cross-contaminants in D0 samples

Summary:

Line 1 Line 2 # Shared SNPs
21 27 12
19 20 5
10 11 5
12 44 4
21 5 4
12 21 4
43 48 4
3 21 3
21 44 3
28 29 3
27 28 2
20 27 2
12 31 2
3 20 2
35 36 2
21 18 2
45 46 2
21 34 2
20 21 2
28 9 1
21 29 1
44 5 1
17 37 1
20 5 1
15 27 1
29 27 1
21 28 1
44 9 1
18 5 1
44 45 1
21 22 1
28 5 1
27 44 1

Line 21 shares SNPs with a lot of other lines - problem with de-multiplexing maybe? If I remove line 21, this is the results:

Without Line 21 Line 1 Line 2 # Shared SNPs
28 9 1
28 27 1
44 5 1
17 37 1
20 5 1
44 9 1
18 5 1
45 44 1
28 5 1
27 44 1
     
20 27 2
     
12 31 2
     
20 3 2
     
29 27 2
     
35 36 2
     
45 46 2
     
28 29 3
     
     
12 44 4
43 48 4
20 19 5
10 11 5
     

Looking at lines with 1 shared SNP, found that GQ scores were low:

Without Line 21             AD DP GQ GT AD DP GQ GT
Line 1 Line 2 # Shared SNPs         Line 1       Line 2      
28 9 1 Spar_II_RaGOO 553279 CA C 211.49 71,14 94 37 CA/C 67,15 95 88 CA/C
28 27 1 Spar_III_RaGOO 11841 G GT 60.16 102,15 119 40 G/GT 33,6 39 42 G/GT
44 5 1 Spar_V_RaGOO 29734 C CA 83.27 9,3 12 27 C/CA 9,3 12 27 C/CA
17 37 1 Spar_VI_RaGOO 21351 CA C 153.15 33,5 38 35 CA/C 115,18 133 94 CA/C
20 5 1 Spar_VI_RaGOO 188146 CT C 70.21 56,9 68 33 CT/C 24,5 29 44 CT/C
44 9 1 Spar_XI_RaGOO 345809 GA G 41.44 12,3 15 40 GA/G 90,13 103 58 GA/G
18 5 1 Spar_XII_RaGOO 197446 AT A 117.67 190,29 234 60 AT/A 41,6 51 2 AT/A
45 44 1 Spar_XIII_RaGOO 307069 A AT 132.01 132,22 161 44 A/AT 8,4 12 62 A/AT
28 5 1 Spar_XIII_RaGOO 906587 G GA 129.16 100,17 126 53 G/GA 36,7 44 61 G/GA
27 44 1 Spar_XIV_RaGOO 446172 C CT 80.2 17,4 24 34 C/CT 19,3 22 34 C/CT
Looking at lines that shared 2 SNPs, found that a particular region on chromosome IV is common: Without Line 21 AD DP GQ GT AD DP GQ GT
Line 1 Line 2 # Shared SNPs Line 1 Line 2
20 27 2 Spar_IV_RaGOO 526131 C CT 38 52,9 63 49 C/CT 27,4 35 3 C/CT
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
      Spar_IV_RaGOO 1446960 AT A 57.25 75,12 91 8 AT/A 26,7 33 87 AT/A
12 31 2 Spar_IV_RaGOO 1460056 G A 170.73 35,4 39 99 G/A 138,15 153 99 G/A
      Spar_IV_RaGOO 1460068 A G 78.13 37,3 40 15 A/G 127,12 139 99 A/G
20 3 2 Spar_IV_RaGOO 1473021 G C 310.74 80,9 89 99 G/C 100,13 113 99 G/C
      Spar_IV_RaGOO 1473039 GT G 283.69 85,9 94 99 GT/G 106,13 119 99 GT/G
29 27 2 Spar_VII_RaGOO 65449 TAC T 69.82 64,8 72 70 TAC/T 34,5 39 62 TAC/T
      Spar_XV_RaGOO 88274 CAT C 55.76 54,7 61 87 CAT/C 37,4 41 31 CAT/C
35 36 2 Spar_VII_RaGOO 72225 T C 26841.28 1,341 342 99 C/C 2,307 309 99 C/C
      Spar_XII_RaGOO 930564 G A 21982.28 1,290 291 99 A/A 0,260 260 99 A/A
45 46 2 Spar_VIII_RaGOO 282813 G T 34727.28 0,182 182 99 T/T 4,696 700 99 T/T
      Spar_XV_RaGOO 912454 C T 24484.28 0,152 152 99 T/T 0,460 460 99 T/T

Looking at ancestor calls in these sites, is GQ low? Does it look heterozygous but was called homozygous?

                Ancestor      
20 27 2 Spar_IV_RaGOO 526131 C CT 38 145,0 145 0 C/C
      Spar_IV_RaGOO 1446960 AT A 57.25 131,13 144 99 AT/AT
12 31 2 Spar_IV_RaGOO 1460056 G A 170.73 109,0 109 99 G/G
      Spar_IV_RaGOO 1460068 A G 78.13 109,0 109 99 A/A
20 3 2 Spar_IV_RaGOO 1473021 G C 310.74 109,0 109 99 G/G
      Spar_IV_RaGOO 1473039 GT G 283.69 143,0 143 96 GT/GT
29 27 2 Spar_VII_RaGOO 65449 TAC T 69.82 102,0 102 99 TAC/TAC
      Spar_XV_RaGOO 88274 CAT C 55.76 86,0 86 99 CAT/CAT
35 36 2 Spar_VII_RaGOO 72225 T C 26841.28 106,0 106 99 T/T
      Spar_XII_RaGOO 930564 G A 21982.28 106,0 106 99 G/G
45 46 2 Spar_VIII_RaGOO 282813 G T 34727.28 109,0 109 99 G/G
      Spar_XV_RaGOO 912454 C T 24484.28 103,0 103 99 C/C

Only one site has a very low GQ score in the ancestor, the rest all look like confident calls.

Going to talk to Dave and decide what to filter out based on GQ scores in the ancestor: Plotted the scores in a distribution to determine what cutoffs to use:

Screen Shot 2020-04-13 at 11 22 32 AM

_Originally posted by @hollygene in https://github.com/hollygene/TE_MA/issues/2#issuecomment-612959410_

hollygene commented 4 years ago

H0 samples Dave worked on dataset further and found several samples with shared SNPs (highlighted in yellow in attached spreadsheet) H0_vcf_all_NEWCALLSv2.xlsx

hollygene commented 4 years ago

Removed samples 15, 22, 29, 36, 37, 40, 42, 46, 48, 5

hollygene commented 4 years ago

Samples that shared SNPs:

32, 37 39, 40 14, 15 28, 29 41, 42 44, 48 45, 46 4, 5 7, 22