hollygene / TE_MA

S. paradoxus TE MA experiment
0 stars 0 forks source link

Paradoxus MA Information #2

Open hollygene opened 4 years ago

hollygene commented 4 years ago

Whole-genome sequencing of S. paradoxus mutation accumulation lines DNA extracted using Zymo YeaSTAR Genomic DNA kit Libraries prepped using homemade protocol (no kit, can find out more info if needed, probably published) Sequenced on NovaSeq S4 flow cell PE 150 at Genewiz 500GB total sequencing data

Location of fastq files on Sapelo: /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq (I also have backups of this on the /project/ folder for our lab, and on an external hard drive)

So one potential problem I have found is that my samples have varying depth of coverage - I guess because they were pooled. My main concern is that my ancestral strains have ~25x coverage, while the progenitor lines have anywhere from 150x-2000x(!).

hollygene commented 4 years ago
hollygene commented 4 years ago

Hi @cbergman and @JingxuanChen7 ,

I'm trying to make a file of hard-to-map sites that I can exclude from my analysis (centromeres, telomeres, mating type region, rDNA, etc), and I found a .gff file for the centromeres but do you happen to know which (if any) files of the new 337 reference genome would be useful for this sort of thing?

For reference I'm looking in the directory: /scratch/hcm14449/TE_MA_Paradoxus/ref_genome/paradoxus/337Ref

Thank you!!! I hope you're staying healthy! -Holly

cbergman commented 4 years ago

Hi Both

Jingxuan, how hard do you think it is for Holly to run the X and Y’ element annotation LRSDAY module vs you adding this to your pipeline?

Holly, if your goal is to annotate/exclude hard to map regions from your SNP calling, it might be more straightforward to compute a 'mappability" track that identifies regions that are hard to map reads to unambiguously to regardless of their functional annotation. There are a few tools to do this that I can point you to. This strategy has the advantage of allowing you to masking all unmappable regions (telomeres, TEs, recent duplicates, etc.).

Thoughts, Casey

On Mar 18, 2020, at 16:47, Holly Celina Mcqueary hmcqueary@uga.edu wrote:

Okay, are telomere annotations alone available? Also is it okay if I use the LRSDAY package in your directory?

Thank you!! -Holly


Holly McQueary PhD Candidate

Hall Lab Department of Genetics University of Georgia From: Jingxuan Chen JingxuanChen@uga.edu Sent: Wednesday, March 18, 2020 4:33 PM To: Holly Celina Mcqueary hmcqueary@uga.edu Cc: Casey Bergman cbergman@uga.edu Subject: Re: telomere and X/Y prime elements annotation for 337

Hi Holly,

I checked LRSDAY pipeline. X and Y’ element annotations are available but not for telomere annotation. Please feel free to let me know if you need any help.

Best, Jingxuan

On Mar 18, 2020, at 4:24 PM, Holly Celina Mcqueary hmcqueary@uga.edu wrote:

Hey Jingxuan,

I realized a bit too late that I need additional annotations for 337 reference genome - specifically telomere and X and Y' elements. I'm going to try some of your pipeline and that of the protocol to generate these, and I will do it in my own directories, but I thought I would keep you updated!!

Hope all is going well for you!

Best, Holly


Holly McQueary PhD Candidate

Hall Lab Department of Genetics University of Georgia

On Mar 17, 2020, at 13:16, Holly McQueary notifications@github.com wrote:

Hi @cbergman and @JingxuanChen7 ,

I'm trying to make a file of hard-to-map sites that I can exclude from my analysis (centromeres, telomeres, mating type region, rDNA, etc), and I found a .gff file for the centromeres but do you happen to know which (if any) files of the new 337 reference genome would be useful for this sort of thing?

For reference I'm looking in the directory: /scratch/hcm14449/TE_MA_Paradoxus/ref_genome/paradoxus/337Ref

Thank you!!! I hope you're staying healthy! -Holly

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

hollygene commented 4 years ago

Hi all,

Casey, yes my exact goal is to exclude regions that are hard to map. In previous studies we've used the SGD_features.tab file from SGD (an example here https://sgd-prod-upload.s3.amazonaws.com/S000211725/SGD_features.20120121.tab.gz ), but if there is a more sophisticated way to do this (especially with our own reference genome), I would much rather do that!!

Could you recommend a tool to do this?

Thank you!! -Holly

JingxuanChen7 commented 4 years ago
cbergman commented 4 years ago
JingxuanChen7 commented 4 years ago
JingxuanChen7 commented 4 years ago

Begin forwarded message:

From: Jingxuan Chen jc33471@uga.edu<mailto:jc33471@uga.edu> Subject: Re: Questions about running GEM-library Date: March 19, 2020 at 3:38:58 PM EDT To: Guilherme Dias Guilherme.Dias@uga.edu<mailto:Guilherme.Dias@uga.edu>

Thank you so much! I also reviewed my fasta file and found empty lines. I simply remove empty lines and it works! I will modify the assembly to a more well-formatted fasta before I continue to run following steps.

Thanks, Jingxuan

On Mar 19, 2020, at 3:35 PM, Guilherme Dias Guilherme.Dias@uga.edu<mailto:Guilherme.Dias@uga.edu> wrote:

Alright, the problem seems to be that your fasta sequences are single-line (I guess RaGOO makes them like that), and GEM needs them to be multi-line. I converted your genome to multi-line and GEM indexer and mappability worked here. I used seqtk seq -A -l 80 input.fasta > output.fasta, where -l is the line width you wish to use.

Let me know if that works for you!

best, Gui

On 19 Mar 2020, at 15:24, Jingxuan Chen JingxuanChen@uga.edu<mailto:JingxuanChen@uga.edu> wrote:

Yes! The log file says fasta input does not work. The genome is here /scratch/jc33471/pilon/337/mappability/genome.337.fasta.

Thanks, Jingxuan

On Mar 19, 2020, at 3:19 PM, Guilherme Dias Guilherme.Dias@uga.edu<mailto:Guilherme.Dias@uga.edu> wrote:

Hi Jingxuan,

I tested running GEM on my side (using the sacCer3 reference genome) and it is working ok. I think this might be some problem with the fasta formatting. I remember GEM was a bit sensitive about it.

Can you point me to the file you are trying to index/calculate mappability? I can take a look.

Gui

On 19 Mar 2020, at 14:17, Jingxuan Chen JingxuanChen@uga.edu<mailto:JingxuanChen@uga.edu> wrote:

Thank you so much for your help!

Jingxuan

On Mar 19, 2020, at 2:14 PM, Guilherme Dias Guilherme.Dias@uga.edu<mailto:Guilherme.Dias@uga.edu> wrote:

Hi Jingxuan,

I don’t remember running into this issue. I’ll do some testing and get back to you!

best, Gui

On 19 Mar 2020, at 14:05, Jingxuan Chen JingxuanChen@uga.edu<mailto:JingxuanChen@uga.edu> wrote:

Hi Gui,

I am now trying to run GEMtools to compute mappability for 337 assembly using the code from your repository https://github.com/bergmanlab/guilherme/blob/dbc4fadad30170d1ea70b213574136753014084b/src/yeast_ploidy.sh#L79 I encountered a problem in the indexing step as below:

jc33471@n561 mappability$ gem-indexer -T $PBS_NP -c dna -i genome.337.fasta -o ${prefix}_index
Welcome to GEM-indexer build 1.423 (beta) - (2013/04/01 01:02:13 GMT)
 (c) 2008-2013 Paolo Ribeca <paolo.ribeca@gmail.com<mailto:paolo.ribeca@gmail.com>>
 (c) 2010-2013 Santiago Marco Sola <santiagomsola@gmail.com<mailto:santiagomsola@gmail.com>>
 (c) 2010-2013 Leonor Frias Moya <leonor.frias@gmail.com<mailto:leonor.frias@gmail.com>>
For the terms of use, run the program with the option --show-license.
************************************************************************
* WARNING: this is a beta version, provided for testing purposes only; *
*          check for updates at <http://gemlibrary.sourceforge.net<http://gemlibrary.sourceforge.net/>>.   *
************************************************************************
Creating sequence and location files...Fatal error: exception Failure("Command 'echo "-------gem-indexer_fasta2meta+cont" > 337_index.log && gem-indexer_fasta2meta+cont -i genome.337.fasta -c dna --filter-function iupac-dna --strip-unknown-bases-threshold 50 --complement-size-threshold 2000000000 --complement emulat

I did a quick search. It seems that this is a common issue among several people https://www.biostars.org/p/311569/ . Do you have any idea about this error? I was running with 10 threads and 60 GB memory.

Hope that everything goes well with you.

Thank you so much, Jingxuan

JingxuanChen7 commented 4 years ago
cbergman commented 4 years ago
hollygene commented 4 years ago

This is fantastic, thank you so much @JingxuanChen7 !

The read length should be 150 bp - it is paired-end so I guess after combining the R1 and R2 fastqs it should be 300? (I'm not confident that's how that works)

I am using raw data for my pipeline.

cbergman commented 4 years ago
hollygene commented 4 years ago

Thank you!

@cbergman Do you have a recommendation for the step in my pipeline that I remove these regions? (i.e. early like at bam stage or later at vcfs?)

JingxuanChen7 commented 4 years ago
cbergman commented 4 years ago
JingxuanChen7 commented 4 years ago
hollygene commented 4 years ago

Updates from 3/19 to present: (Had this in a Word doc, but will update on here from now on)

3/19/20

3/26/20

3/27/20

File Length D0_FullCohort.vcf 1595 Raw variants D0_reducedGEM.vcf 1042 Removed low-mappabiltiy sites D0_reducedGEM_DpGr10_Fil.vcf 827 Removed sites with depth <10 D0_reducedGEM_DpGr10_Fil_AncCalls.vcf 811 Removed sites with ancestor no-calls (./.) D0_reducedGEM_DpGr10_Fil_AncCalls_NoHets.vcf 664 Removed sites with ancestor genotype as Heterozygous

Alternative method: D0_FullCohort.vcf 1595 Raw variants D0_FullCohort_DpGr10_Fil.vcf 1128 Remove sites with depth < 10 D0_FullCohort_DpGr10_Fil_AncCalls.vcf 1094 Remove sites with ancestor no-calls (./.) D0_FullCohort_DpGr10_Fil_AncCalls_NoHets.vcf 854 Remove sites with ancestor genotype as heterozygous D0_FullCohort_DpGr10_Fil_AncCalls_NoHets_GEM.vcf 664 Removed low-mappability sites

Checked to make sure they both gave the same thing – they did (exact same results)

Using the First method, because it seems to make more sense (remove low-mapping sites first instead of at the end)

4/1/20

Chromosome I depth distribution (x axis is location on chr I, y axis is depth) Screen Shot 2020-04-01 at 1 45 29 PM

Chromosome II: Screen Shot 2020-04-01 at 1 51 51 PM

Chromosome III: Screen Shot 2020-04-01 at 1 53 51 PM

Chromosome IV: Screen Shot 2020-04-01 at 1 57 26 PM

Chromosome V: Screen Shot 2020-04-01 at 2 03 58 PM

Chromosome VI: Screen Shot 2020-04-01 at 3 09 01 PM

Chromosome VII: Screen Shot 2020-04-01 at 3 13 00 PM

Chromosome VIII: Screen Shot 2020-04-01 at 3 14 14 PM

Chromosome IX: Screen Shot 2020-04-01 at 3 15 29 PM

Chromosome X: Screen Shot 2020-04-01 at 3 19 18 PM

Chromosome XI: Screen Shot 2020-04-01 at 3 21 56 PM

Chromosome XII: Screen Shot 2020-04-01 at 3 23 41 PM

Chromosome XIII: Screen Shot 2020-04-01 at 3 28 20 PM

Chromosome XIV: Screen Shot 2020-04-01 at 3 29 32 PM

Chromosome XV: Screen Shot 2020-04-01 at 3 30 47 PM

Chromosome XVI: Screen Shot 2020-04-01 at 3 32 13 PM

Filtering of Variants D0 Low depth: 82 High depth: 206

File Length D0_FullCohort.vcf 1595 D0_noLow.vcf 962 D0_noLow_noHigh.vcf 772 D0_noLow_noHigh_redGEM.vcf 580 D0_noLow_noHigh_redGEM_AncCalls.vcf 573 D0_noLow_noHigh_redGEM_AncCalls_NoHets.vcf 471

D1 Low depth: 70 High depth: 188

File Length D1_FullCohort.vcf 2695 D1_noLow.vcf 1586 D1_noLow_noHigh.vcf 1346 D1_noLow_noHigh_redGEM.vcf 971 D1_noLow_noHigh_redGEM_AncCalls.vcf 958 D1_noLow_noHigh_redGEM_AncCalls_NoHets.vcf 812

D20 Low depth: 82 High depth: 206

File Length D20_FullCohort.vcf 1835 D20_noLow.vcf 1251 D20_noLow_noHigh.vcf 963 D20_noLow_noHigh_redGEM.vcf 741 D20_noLow_noHigh_redGEM_AncCalls.vcf 721 D20_noLow_noHigh_redGEM_AncCalls_NoHets.vcf 526

Found variants in the ancestors:

D20 ancestor variants

Chr Pos GT Type VII 61782 1/1 ALT – indel XVI 186933 1/1 ALT – indel

D1 ancestor variants

Chr Pos GT Type IV 641630 2/2 ALT – indel (het in D20 & majority of progeny) VII 61782 1/1 ALT – indel VII 986079 1/1 ALT – SNP (not in D20) XV 299062 1/1 ALT – SNP (not in D20) XVI 186933 2/2 ALT – SNP

D0 ancestor variants

Chr Pos GT Type IV 641630 2/2 ALT – indel VII 61782 1/1 ALT – indel XIII 111478 ½ ALT – indel XV 314234 ½ ALT – indel

4/3/20 Heterozygous sites shared between the 3 diploid ancestors: 40 total File: /Users/hollymcqueary/Dropbox/McQueary/Paradoxus_MA/VCFs/AncHets/D0_D1_D20_common.txt

4/7/20 H0 median depth = 153

Screen Shot 2020-04-07 at 5 51 19 PM

Filtering based on depth of 150 though, so removing anything less than 90 and greater than 218

hollygene commented 4 years ago

H0 Final Cohort Variant Info: hcm14449@n204 H0$ wc -l H0_FullCohort.vcf 1729 H0_FullCohort.vcf hcm14449@n204 H0$ pwd /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/Out/H0

1729 raw variants

Remove sites with depth less than 90: hcm14449@n204 H0$ wc -l H0_noLow.vcf 1080 H0_noLow.vcf

Remove sites with depth greater than 218 hcm14449@n204 H0$ wc -l H0_noLow_noHigh.vcf 889 H0_noLow_noHigh.vcf

Remove hard-to-map regions called by GEM: hcm14449@n204 H0$ wc -l H0_noLow_noHigh_redGem.vcf 668 H0_noLow_noHigh_redGem.vcf

Removed sites with no calls in the ancestor: hcm14449@n204 H0$ wc -l H0_noLow_noHigh_redGem_AncCalls.vcf 654 H0_noLow_noHigh_redGem_AncCalls.vcf

Removed sites with hets in the ancestor hcm14449@n204 H0$ wc -l H0_noLow_noHigh_redGem_AncCalls_NoHets.vcf 545 H0_noLow_noHigh_redGem_AncCalls_NoHets.vcf

SNPs:

hcm14449@n204 H0$ wc -l H0_noLow_noHigh_redGem_AncCalls_NoHets_SNPs.vcf
344 H0_noLow_noHigh_redGem_AncCalls_NoHets_SNPs.vcf

Indels: hcm14449@n204 H0$ wc -l H0_noLow_noHigh_redGem_AncCalls_NoHets_Indels.vcf 223 H0_noLow_noHigh_redGem_AncCalls_NoHets_Indels.vcf

hollygene commented 4 years ago

Looked at cross-contaminants in D0 samples

Summary:

Line 1 Line 2 # Shared SNPs
21 27 12
19 20 5
10 11 5
12 44 4
21 5 4
12 21 4
43 48 4
3 21 3
21 44 3
28 29 3
27 28 2
20 27 2
12 31 2
3 20 2
35 36 2
21 18 2
45 46 2
21 34 2
20 21 2
28 9 1
21 29 1
44 5 1
17 37 1
20 5 1
15 27 1
29 27 1
21 28 1
44 9 1
18 5 1
44 45 1
21 22 1
28 5 1
27 44 1

Line 21 shares SNPs with a lot of other lines - problem with de-multiplexing maybe? If I remove line 21, this is the results:

Without Line 21 Line 1 Line 2 # Shared SNPs
28 9 1
28 27 1
44 5 1
17 37 1
20 5 1
44 9 1
18 5 1
45 44 1
28 5 1
27 44 1
     
20 27 2
     
12 31 2
     
20 3 2
     
29 27 2
     
35 36 2
     
45 46 2
     
28 29 3
     
     
12 44 4
43 48 4
20 19 5
10 11 5
     

Looking at lines with 1 shared SNP, found that GQ scores were low:

Without Line 21             AD DP GQ GT AD DP GQ GT
Line 1 Line 2 # Shared SNPs         Line 1       Line 2      
28 9 1 Spar_II_RaGOO 553279 CA C 211.49 71,14 94 37 CA/C 67,15 95 88 CA/C
28 27 1 Spar_III_RaGOO 11841 G GT 60.16 102,15 119 40 G/GT 33,6 39 42 G/GT
44 5 1 Spar_V_RaGOO 29734 C CA 83.27 9,3 12 27 C/CA 9,3 12 27 C/CA
17 37 1 Spar_VI_RaGOO 21351 CA C 153.15 33,5 38 35 CA/C 115,18 133 94 CA/C
20 5 1 Spar_VI_RaGOO 188146 CT C 70.21 56,9 68 33 CT/C 24,5 29 44 CT/C
44 9 1 Spar_XI_RaGOO 345809 GA G 41.44 12,3 15 40 GA/G 90,13 103 58 GA/G
18 5 1 Spar_XII_RaGOO 197446 AT A 117.67 190,29 234 60 AT/A 41,6 51 2 AT/A
45 44 1 Spar_XIII_RaGOO 307069 A AT 132.01 132,22 161 44 A/AT 8,4 12 62 A/AT
28 5 1 Spar_XIII_RaGOO 906587 G GA 129.16 100,17 126 53 G/GA 36,7 44 61 G/GA
27 44 1 Spar_XIV_RaGOO 446172 C CT 80.2 17,4 24 34 C/CT 19,3 22 34 C/CT
Looking at lines that shared 2 SNPs, found that a particular region on chromosome IV is common: Without Line 21 AD DP GQ GT AD DP GQ GT
Line 1 Line 2 # Shared SNPs Line 1 Line 2
20 27 2 Spar_IV_RaGOO 526131 C CT 38 52,9 63 49 C/CT 27,4 35 3 C/CT
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
      Spar_IV_RaGOO 1446960 AT A 57.25 75,12 91 8 AT/A 26,7 33 87 AT/A
12 31 2 Spar_IV_RaGOO 1460056 G A 170.73 35,4 39 99 G/A 138,15 153 99 G/A
      Spar_IV_RaGOO 1460068 A G 78.13 37,3 40 15 A/G 127,12 139 99 A/G
20 3 2 Spar_IV_RaGOO 1473021 G C 310.74 80,9 89 99 G/C 100,13 113 99 G/C
      Spar_IV_RaGOO 1473039 GT G 283.69 85,9 94 99 GT/G 106,13 119 99 GT/G
29 27 2 Spar_VII_RaGOO 65449 TAC T 69.82 64,8 72 70 TAC/T 34,5 39 62 TAC/T
      Spar_XV_RaGOO 88274 CAT C 55.76 54,7 61 87 CAT/C 37,4 41 31 CAT/C
35 36 2 Spar_VII_RaGOO 72225 T C 26841.28 1,341 342 99 C/C 2,307 309 99 C/C
      Spar_XII_RaGOO 930564 G A 21982.28 1,290 291 99 A/A 0,260 260 99 A/A
45 46 2 Spar_VIII_RaGOO 282813 G T 34727.28 0,182 182 99 T/T 4,696 700 99 T/T
      Spar_XV_RaGOO 912454 C T 24484.28 0,152 152 99 T/T 0,460 460 99 T/T

Looking at ancestor calls in these sites, is GQ low? Does it look heterozygous but was called homozygous?

                Ancestor      
20 27 2 Spar_IV_RaGOO 526131 C CT 38 145,0 145 0 C/C
      Spar_IV_RaGOO 1446960 AT A 57.25 131,13 144 99 AT/AT
12 31 2 Spar_IV_RaGOO 1460056 G A 170.73 109,0 109 99 G/G
      Spar_IV_RaGOO 1460068 A G 78.13 109,0 109 99 A/A
20 3 2 Spar_IV_RaGOO 1473021 G C 310.74 109,0 109 99 G/G
      Spar_IV_RaGOO 1473039 GT G 283.69 143,0 143 96 GT/GT
29 27 2 Spar_VII_RaGOO 65449 TAC T 69.82 102,0 102 99 TAC/TAC
      Spar_XV_RaGOO 88274 CAT C 55.76 86,0 86 99 CAT/CAT
35 36 2 Spar_VII_RaGOO 72225 T C 26841.28 106,0 106 99 T/T
      Spar_XII_RaGOO 930564 G A 21982.28 106,0 106 99 G/G
45 46 2 Spar_VIII_RaGOO 282813 G T 34727.28 109,0 109 99 G/G
      Spar_XV_RaGOO 912454 C T 24484.28 103,0 103 99 C/C

Only one site has a very low GQ score in the ancestor, the rest all look like confident calls.

Going to talk to Dave and decide what to filter out based on GQ scores in the ancestor: Plotted the scores in a distribution to determine what cutoffs to use:

Screen Shot 2020-04-13 at 11 22 32 AM
hollygene commented 4 years ago

Issue with a lot of the sites that appear in the cross-contaminants is that they are called heterozygous but are very clearly not (i.e. the allele depth is 110,10 and it calls it het)

Also, there are several sites where there are two alternate alleles, but this is not possible so we are removing those

Screen Shot 2020-04-14 at 1 09 07 PM
hollygene commented 4 years ago

I can go back and update this later, because I think the labeling of the columns in this file is off, but basically it looks like in H0, D0, and D1 there are less heterozygous sites than would be expected (i.e. basically none). This makes sense in haploids as a new mutation would definitely be homozygous, but the opposite should be true in diploids - where a new mutation should almost always be heterozygous. That's what is shown in D20 so we're pretty confident D20 is correct.

I am going back and re-running one of my scripts to see if I had a weird error somewhere to have caused this, and am thinking of looking at the MAT locus to differentiate between haploid and diploid and/or looking at any aneuploidy in the samples.

@cbergman and @JingxuanChen7 , if you guys have any advice on how to go about this or what could cause it (besides me being dumb at the beginning of my experiment), please let me know!

Also, once we get back into lab we can test the strains in the freezer and see if they mate (if they do = def haploid, if they don't = def diploid), we will definitely do this before we publish but would like to know computationally asap so we can move onto actual analyses.

hollygene commented 4 years ago

HML.R_depth.txt

Analyzed HMR and HML loci for coverage. Method: masked the MAT locus in genome 337 re-aligned reads only to HMR and HML loci https://github.com/hollygene/TE_MA/blob/TE_MA/maskFASTA.sh

Put resulting data into Excel, and found ratios of the coverage between the two loci (file attached). All H0 samples are haploid alpha All D0 samples are haploid alpha All D1 samples are haploid alpha D20 samples are diploid except for: D20-3, 29, 30: Haploid alpha D20-4: Haploid a (likely 1938)

My guess is that there were heterogeneous cell populations at the beginning of the experiment. The ancestor spike-in data, looks to be D20- haploid alpha D1- diploid D0-diploid H0-diploid

So, clearly there are some issues here.

For now, I am going to focus my efforts on finding locations of TEs in the samples I am confident in, and plotting them layered with coverage plots to see if there is any correlation between TE location and chromosome ploidy/breakpoints (there are several partial duplications/deletions).

There are plans to do a mating experiment with the ancestor samples in the freezer, but I am not going to worry about that yet and focus on the dataset I currently have, which is now 2 haploid Ty-less strains, 1 haploid strain with 1 Ty1, and one diploid strain with ~15 Ty1s.

Dave is working on fixing some of the issues caused by the variant caller relating to allele frequencies and weird calls (i.e. some of them are 80/20 called homozygous, some are 90/10 called heterozygous?)

JingxuanChen7 commented 4 years ago

Re-run whole dataset with McClintock

LTR