Paradoxus MA Information

hollygene commented 5 years ago

Whole-genome sequencing of S. paradoxus mutation accumulation lines DNA extracted using Zymo YeaSTAR Genomic DNA kit Libraries prepped using homemade protocol (no kit, can find out more info if needed, probably published) Sequenced on NovaSeq S4 flow cell PE 150 at Genewiz 500GB total sequencing data

Location of fastq files on Sapelo: /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq (I also have backups of this on the /project/ folder for our lab, and on an external hard drive)

So one potential problem I have found is that my samples have varying depth of coverage - I guess because they were pooled. My main concern is that my ancestral strains have ~25x coverage, while the progenitor lines have anywhere from 150x-2000x(!).

hollygene commented 4 years ago

I'll talk to Dave today or tomorrow about this, and see what he thinks about re-sequencing the ancestors, but I think it might be a good idea (maybe even re-extracting the DNA?).
I will think about the masking mat locus experiment and come up with some ideas, and I will let you know
I was wondering about the Ty2 copy numbers, so it's probably a good idea to take those out of there since it was kind of confusing
I will start SNP calling on @JingxuanChen7 's bam files and will post updates when they are done!

hollygene commented 4 years ago

Hi @cbergman and @JingxuanChen7 ,

I'm trying to make a file of hard-to-map sites that I can exclude from my analysis (centromeres, telomeres, mating type region, rDNA, etc), and I found a .gff file for the centromeres but do you happen to know which (if any) files of the new 337 reference genome would be useful for this sort of thing?

For reference I'm looking in the directory: /scratch/hcm14449/TE_MA_Paradoxus/ref_genome/paradoxus/337Ref

Thank you!!! I hope you're staying healthy! -Holly

cbergman commented 4 years ago

Hi Both

Jingxuan, how hard do you think it is for Holly to run the X and Y’ element annotation LRSDAY module vs you adding this to your pipeline?

Holly, if your goal is to annotate/exclude hard to map regions from your SNP calling, it might be more straightforward to compute a 'mappability" track that identifies regions that are hard to map reads to unambiguously to regardless of their functional annotation. There are a few tools to do this that I can point you to. This strategy has the advantage of allowing you to masking all unmappable regions (telomeres, TEs, recent duplicates, etc.).

Thoughts, Casey

On Mar 18, 2020, at 16:47, Holly Celina Mcqueary hmcqueary@uga.edu wrote:

Okay, are telomere annotations alone available? Also is it okay if I use the LRSDAY package in your directory?

Thank you!! -Holly

Holly McQueary PhD Candidate

Hall Lab Department of Genetics University of Georgia From: Jingxuan Chen JingxuanChen@uga.edu Sent: Wednesday, March 18, 2020 4:33 PM To: Holly Celina Mcqueary hmcqueary@uga.edu Cc: Casey Bergman cbergman@uga.edu Subject: Re: telomere and X/Y prime elements annotation for 337

Hi Holly,

I checked LRSDAY pipeline. X and Y’ element annotations are available but not for telomere annotation. Please feel free to let me know if you need any help.

Best, Jingxuan

On Mar 18, 2020, at 4:24 PM, Holly Celina Mcqueary hmcqueary@uga.edu wrote:

Hey Jingxuan,

I realized a bit too late that I need additional annotations for 337 reference genome - specifically telomere and X and Y' elements. I'm going to try some of your pipeline and that of the protocol to generate these, and I will do it in my own directories, but I thought I would keep you updated!!

Hope all is going well for you!

Best, Holly

Holly McQueary PhD Candidate

Hall Lab Department of Genetics University of Georgia

On Mar 17, 2020, at 13:16, Holly McQueary notifications@github.com wrote:

Hi @cbergman and @JingxuanChen7 ,

I'm trying to make a file of hard-to-map sites that I can exclude from my analysis (centromeres, telomeres, mating type region, rDNA, etc), and I found a .gff file for the centromeres but do you happen to know which (if any) files of the new 337 reference genome would be useful for this sort of thing?

For reference I'm looking in the directory: /scratch/hcm14449/TE_MA_Paradoxus/ref_genome/paradoxus/337Ref

Thank you!!! I hope you're staying healthy! -Holly

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

hollygene commented 4 years ago

Hi all,

Casey, yes my exact goal is to exclude regions that are hard to map. In previous studies we've used the SGD_features.tab file from SGD (an example here https://sgd-prod-upload.s3.amazonaws.com/S000211725/SGD_features.20120121.tab.gz ), but if there is a more sophisticated way to do this (especially with our own reference genome), I would much rather do that!!

Could you recommend a tool to do this?

Thank you!! -Holly

JingxuanChen7 commented 4 years ago

I think the pipeline is pretty straight forward. The only small issue might be software installation. Both works for me.

cbergman commented 4 years ago

Let's try computing a mappability track first. It is computable from the reference genome only and allows a relatively unbiased masking of the genome that doesn't require us to annotate X and Y’ element, telomers, etc.
Gui computes mappability tracks using the GEM toolbox. Here is an example: https://github.com/bergmanlab/guilherme/blob/dbc4fadad30170d1ea70b213574136753014084b/src/yeast_ploidy.sh#L79
Could you work together to see if this makes sense and get in touch with Gui if you have any questions?
Thanks!

JingxuanChen7 commented 4 years ago

I tried to GEM toolbox and encountered bug at the indexing step:


jc33471@n561 mappability$ gem-indexer -T $PBS_NP -c dna -i genome.337.fasta -o ${prefix}_index
Welcome to GEM-indexer build 1.423 (beta) - (2013/04/01 01:02:13 GMT)
(c) 2008-2013 Paolo Ribeca <paolo.ribeca@gmail.com>
(c) 2010-2013 Santiago Marco Sola <santiagomsola@gmail.com>
(c) 2010-2013 Leonor Frias Moya <leonor.frias@gmail.com>
For the terms of use, run the program with the option --show-license.
************************************************************************

WARNING: this is a beta version, provided for testing purposes only; *
check for updates at http://gemlibrary.sourceforge.net. *

Creating sequence and location files...Fatal error: exception Failure("Command 'echo "-------gem-indexer_fasta2meta+cont" > 337_index.log && gem-indexer_fasta2meta+cont -i genome.337.fasta -c dna --filter-function iupac-dna --strip-unknown-bases-threshold 50 --complement-size-threshold 2000000000 --complement emulat
I have emailed Gui about this question.

JingxuanChen7 commented 4 years ago

Problem solved. It seems that GEM is sensitive with fasta format. The genome assembly contains empty lines. I remove empty lines and GEM works fine.

Begin forwarded message:

From: Jingxuan Chen jc33471@uga.edu<mailto:jc33471@uga.edu> Subject: Re: Questions about running GEM-library Date: March 19, 2020 at 3:38:58 PM EDT To: Guilherme Dias Guilherme.Dias@uga.edu<mailto:Guilherme.Dias@uga.edu>

Thank you so much! I also reviewed my fasta file and found empty lines. I simply remove empty lines and it works! I will modify the assembly to a more well-formatted fasta before I continue to run following steps.

Thanks, Jingxuan

On Mar 19, 2020, at 3:35 PM, Guilherme Dias Guilherme.Dias@uga.edu<mailto:Guilherme.Dias@uga.edu> wrote:

Alright, the problem seems to be that your fasta sequences are single-line (I guess RaGOO makes them like that), and GEM needs them to be multi-line. I converted your genome to multi-line and GEM indexer and mappability worked here. I used seqtk seq -A -l 80 input.fasta > output.fasta, where -l is the line width you wish to use.

Let me know if that works for you!

best, Gui

On 19 Mar 2020, at 15:24, Jingxuan Chen JingxuanChen@uga.edu<mailto:JingxuanChen@uga.edu> wrote:

Yes! The log file says fasta input does not work. The genome is here /scratch/jc33471/pilon/337/mappability/genome.337.fasta.

Thanks, Jingxuan

On Mar 19, 2020, at 3:19 PM, Guilherme Dias Guilherme.Dias@uga.edu<mailto:Guilherme.Dias@uga.edu> wrote:

Hi Jingxuan,

I tested running GEM on my side (using the sacCer3 reference genome) and it is working ok. I think this might be some problem with the fasta formatting. I remember GEM was a bit sensitive about it.

Can you point me to the file you are trying to index/calculate mappability? I can take a look.

Gui

On 19 Mar 2020, at 14:17, Jingxuan Chen JingxuanChen@uga.edu<mailto:JingxuanChen@uga.edu> wrote:

Thank you so much for your help!

Jingxuan

On Mar 19, 2020, at 2:14 PM, Guilherme Dias Guilherme.Dias@uga.edu<mailto:Guilherme.Dias@uga.edu> wrote:

Hi Jingxuan,

I don’t remember running into this issue. I’ll do some testing and get back to you!

best, Gui

On 19 Mar 2020, at 14:05, Jingxuan Chen JingxuanChen@uga.edu<mailto:JingxuanChen@uga.edu> wrote:

Hi Gui,

I am now trying to run GEMtools to compute mappability for 337 assembly using the code from your repository https://github.com/bergmanlab/guilherme/blob/dbc4fadad30170d1ea70b213574136753014084b/src/yeast_ploidy.sh#L79 I encountered a problem in the indexing step as below:

jc33471@n561 mappability$ gem-indexer -T $PBS_NP -c dna -i genome.337.fasta -o ${prefix}_index
Welcome to GEM-indexer build 1.423 (beta) - (2013/04/01 01:02:13 GMT)
 (c) 2008-2013 Paolo Ribeca <paolo.ribeca@gmail.com<mailto:paolo.ribeca@gmail.com>>
 (c) 2010-2013 Santiago Marco Sola <santiagomsola@gmail.com<mailto:santiagomsola@gmail.com>>
 (c) 2010-2013 Leonor Frias Moya <leonor.frias@gmail.com<mailto:leonor.frias@gmail.com>>
For the terms of use, run the program with the option --show-license.
************************************************************************
* WARNING: this is a beta version, provided for testing purposes only; *
*          check for updates at <http://gemlibrary.sourceforge.net<http://gemlibrary.sourceforge.net/>>.   *
************************************************************************
Creating sequence and location files...Fatal error: exception Failure("Command 'echo "-------gem-indexer_fasta2meta+cont" > 337_index.log && gem-indexer_fasta2meta+cont -i genome.337.fasta -c dna --filter-function iupac-dna --strip-unknown-bases-threshold 50 --complement-size-threshold 2000000000 --complement emulat

I did a quick search. It seems that this is a common issue among several people https://www.biostars.org/p/311569/ . Do you have any idea about this error? I was running with 10 threads and 60 GB memory.

Hope that everything goes well with you.

Thank you so much, Jingxuan

JingxuanChen7 commented 4 years ago

Computing mappability pipeline works. All the outputs (read length 300) here /scratch/jc33471/pilon/337/mappability
Scripts are the same as Gui's. https://github.com/bergmanlab/jingxuan/commit/2cda50d99e55d6c76222e71af4c8a28903bfabec
e.g. First 2 tracks show the mappability. 3rd track highlights the sites with <0.9 mappability.
@hollygene Do you have any idea about the average read length for your old and new sequencing run? We might need to change the read length parameter. (Test run is 300)

cbergman commented 4 years ago

Awesome, great work @JingxuanChen7.
You should be able to compute the average read length from the raw data if necessary.
Are you both using raw or trimmed data as input for your pipelines? If it is raw, read lengths should be the same for all samples. If it is trimmed, average read length should vary from sample to sample. Also, we may need to set a lower trimmed read length limit that we then use for computing mappability tracks.

hollygene commented 4 years ago

This is fantastic, thank you so much @JingxuanChen7 !

The read length should be 150 bp - it is paired-end so I guess after combining the R1 and R2 fastqs it should be 300? (I'm not confident that's how that works)

I am using raw data for my pipeline.

cbergman commented 4 years ago

The read length for GEM should be the single ended read length (since mapping takes places on a single read basis; paired end information is used secondarily as metadata in the BAM file)
@JingxuanChen7, I would make a new mappability track with 150 bp as the setting for @hollygene to use. Could you also provide Holly with a set of intervals in BED format that have low mappability and should be excluded from her MA analysis.

hollygene commented 4 years ago

Thank you!

@cbergman Do you have a recommendation for the step in my pipeline that I remove these regions? (i.e. early like at bam stage or later at vcfs?)

JingxuanChen7 commented 4 years ago

Change read length to 150 and rerun the pipeline. Results here /scratch/jc33471/pilon/337/mappability
337_lowmappability.bed is the regions with mappability < 0.9.
script to compute mappability: https://github.com/bergmanlab/jingxuan/blob/c314cfe85c7f85b1cff05b28332ef40a072a18b3/src/shell/337_annotation.sh#L349

cbergman commented 4 years ago

Thanks @JingxuanChen7. Can you post a link to the scripts used for this analysis?
@hollygene I would remove these regions at the VCF stage. Let the reads get mapped genome wide, call variants genome wide, then restrict your analysis to variant only in high-mappability regions.

JingxuanChen7 commented 4 years ago

Part of script here: https://github.com/bergmanlab/jingxuan/blob/c314cfe85c7f85b1cff05b28332ef40a072a18b3/src/shell/337_annotation.sh#L349
The previous post also edited.

hollygene commented 4 years ago

Updates from 3/19 to present: (Had this in a Word doc, but will update on here from now on)

3/19/20

Using vcfR to look at quality control and parse VCFs
D0: read depth by sample violinD0.pdf heatmapD0.pdf

3/26/20

Looking at various filtering parameters, to determine what filtering criteria I should include in my analysis
Guidance from https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants o GATK suggests using the following guidelines (with guidance from your own dataset based on the various plots you generate from the annotations in the vcf file
My annotation distributions, from D0 dataset:
Guidelines in Otto paper were:
Sites were excluded from further analysis if >10% of samples had no data or if the average depth across samples was >2 x the average median depth for that chromosome. Variant calls for a sample were also excluded if sample depth was <4 in haploids, <8 in diploids, if sample depth was >2 x the median depth for that chromosome and sample, if map quality was <50, if strand bias was high (P < 0.01), or if a putative variant was found in >1 reads in other samples.
VCF files that I have don’t have a P-value for strand bias, but GATK recommends removing sites with SOR greater than 60
Need to ask Dave about what cutoff to use for sample depth

3/27/20

Filtered D0 cohort fully using two methods to make sure they gave me the same result: o Remove low-mappability sites first, then filter based on my criteria o Filter based on my criteria first, then remove low-mappability sites o Both gave me the exact same results

File Length D0_FullCohort.vcf 1595 Raw variants D0_reducedGEM.vcf 1042 Removed low-mappabiltiy sites D0_reducedGEM_DpGr10_Fil.vcf 827 Removed sites with depth <10 D0_reducedGEM_DpGr10_Fil_AncCalls.vcf 811 Removed sites with ancestor no-calls (./.) D0_reducedGEM_DpGr10_Fil_AncCalls_NoHets.vcf 664 Removed sites with ancestor genotype as Heterozygous

Alternative method: D0_FullCohort.vcf 1595 Raw variants D0_FullCohort_DpGr10_Fil.vcf 1128 Remove sites with depth < 10 D0_FullCohort_DpGr10_Fil_AncCalls.vcf 1094 Remove sites with ancestor no-calls (./.) D0_FullCohort_DpGr10_Fil_AncCalls_NoHets.vcf 854 Remove sites with ancestor genotype as heterozygous D0_FullCohort_DpGr10_Fil_AncCalls_NoHets_GEM.vcf 664 Removed low-mappability sites

Checked to make sure they both gave the same thing – they did (exact same results)

Using the First method, because it seems to make more sense (remove low-mapping sites first instead of at the end)

4/1/20

Plotting depth distribution of D0 ancestor
Median depth looks ~140
Mean depth is 179, but this causes the depth cutoffs to be very stringent and so ~1000 variants were removed based on low depth alone
Going to use median depth to calculate depth cutoffs instead of mean as there are several high depth sites that are skewing the mean

Chromosome I depth distribution (x axis is location on chr I, y axis is depth) Screen Shot 2020-04-01 at 1 45 29 PM

Chromosome II: Screen Shot 2020-04-01 at 1 51 51 PM

Chromosome III: Screen Shot 2020-04-01 at 1 53 51 PM

Chromosome IV: Screen Shot 2020-04-01 at 1 57 26 PM

Chromosome V: Screen Shot 2020-04-01 at 2 03 58 PM

Chromosome VI: Screen Shot 2020-04-01 at 3 09 01 PM

Chromosome VII: Screen Shot 2020-04-01 at 3 13 00 PM

Chromosome VIII: Screen Shot 2020-04-01 at 3 14 14 PM

Chromosome IX: Screen Shot 2020-04-01 at 3 15 29 PM

Chromosome X: Screen Shot 2020-04-01 at 3 19 18 PM

Chromosome XI: Screen Shot 2020-04-01 at 3 21 56 PM

Chromosome XII: Screen Shot 2020-04-01 at 3 23 41 PM

Chromosome XIII: Screen Shot 2020-04-01 at 3 28 20 PM

Chromosome XIV: Screen Shot 2020-04-01 at 3 29 32 PM

Chromosome XV: Screen Shot 2020-04-01 at 3 30 47 PM

Chromosome XVI: Screen Shot 2020-04-01 at 3 32 13 PM

Filtering of Variants D0 Low depth: 82 High depth: 206

File Length D0_FullCohort.vcf 1595 D0_noLow.vcf 962 D0_noLow_noHigh.vcf 772 D0_noLow_noHigh_redGEM.vcf 580 D0_noLow_noHigh_redGEM_AncCalls.vcf 573 D0_noLow_noHigh_redGEM_AncCalls_NoHets.vcf 471

D1 Low depth: 70 High depth: 188

File Length D1_FullCohort.vcf 2695 D1_noLow.vcf 1586 D1_noLow_noHigh.vcf 1346 D1_noLow_noHigh_redGEM.vcf 971 D1_noLow_noHigh_redGEM_AncCalls.vcf 958 D1_noLow_noHigh_redGEM_AncCalls_NoHets.vcf 812

D20 Low depth: 82 High depth: 206

File Length D20_FullCohort.vcf 1835 D20_noLow.vcf 1251 D20_noLow_noHigh.vcf 963 D20_noLow_noHigh_redGEM.vcf 741 D20_noLow_noHigh_redGEM_AncCalls.vcf 721 D20_noLow_noHigh_redGEM_AncCalls_NoHets.vcf 526

Found variants in the ancestors:

D20 ancestor variants

Chr Pos GT Type VII 61782 1/1 ALT – indel XVI 186933 1/1 ALT – indel

D1 ancestor variants

Chr Pos GT Type IV 641630 2/2 ALT – indel (het in D20 & majority of progeny) VII 61782 1/1 ALT – indel VII 986079 1/1 ALT – SNP (not in D20) XV 299062 1/1 ALT – SNP (not in D20) XVI 186933 2/2 ALT – SNP

D0 ancestor variants

Chr Pos GT Type IV 641630 2/2 ALT – indel VII 61782 1/1 ALT – indel XIII 111478 ½ ALT – indel XV 314234 ½ ALT – indel

4/3/20 Heterozygous sites shared between the 3 diploid ancestors: 40 total File: /Users/hollymcqueary/Dropbox/McQueary/Paradoxus_MA/VCFs/AncHets/D0_D1_D20_common.txt

4/7/20 H0 median depth = 153

Filtering based on depth of 150 though, so removing anything less than 90 and greater than 218

hollygene commented 4 years ago

H0 Final Cohort Variant Info: hcm14449@n204 H0$ wc -l H0_FullCohort.vcf 1729 H0_FullCohort.vcf hcm14449@n204 H0$ pwd /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/Out/H0

1729 raw variants

Remove sites with depth less than 90: hcm14449@n204 H0$ wc -l H0_noLow.vcf 1080 H0_noLow.vcf

Remove sites with depth greater than 218 hcm14449@n204 H0$ wc -l H0_noLow_noHigh.vcf 889 H0_noLow_noHigh.vcf

Remove hard-to-map regions called by GEM: hcm14449@n204 H0$ wc -l H0_noLow_noHigh_redGem.vcf 668 H0_noLow_noHigh_redGem.vcf

Removed sites with no calls in the ancestor: hcm14449@n204 H0$ wc -l H0_noLow_noHigh_redGem_AncCalls.vcf 654 H0_noLow_noHigh_redGem_AncCalls.vcf

Removed sites with hets in the ancestor hcm14449@n204 H0$ wc -l H0_noLow_noHigh_redGem_AncCalls_NoHets.vcf 545 H0_noLow_noHigh_redGem_AncCalls_NoHets.vcf

SNPs:

hcm14449@n204 H0$ wc -l H0_noLow_noHigh_redGem_AncCalls_NoHets_SNPs.vcf
344 H0_noLow_noHigh_redGem_AncCalls_NoHets_SNPs.vcf

Indels: hcm14449@n204 H0$ wc -l H0_noLow_noHigh_redGem_AncCalls_NoHets_Indels.vcf 223 H0_noLow_noHigh_redGem_AncCalls_NoHets_Indels.vcf

hollygene commented 4 years ago

Looked at cross-contaminants in D0 samples

Summary:

Line 1	Line 2	# Shared SNPs
21	27	12
19	20	5
10	11	5
12	44	4
21	5	4
12	21	4
43	48	4
3	21	3
21	44	3
28	29	3
27	28	2
20	27	2
12	31	2
3	20	2
35	36	2
21	18	2
45	46	2
21	34	2
20	21	2
28	9	1
21	29	1
44	5	1
17	37	1
20	5	1
15	27	1
29	27	1
21	28	1
44	9	1
18	5	1
44	45	1
21	22	1
28	5	1
27	44	1

Line 21 shares SNPs with a lot of other lines - problem with de-multiplexing maybe? If I remove line 21, this is the results:

Without Line 21 Line 1	Line 2	# Shared SNPs
28	9	1
28	27	1
44	5	1
17	37	1
20	5	1
44	9	1
18	5	1
45	44	1
28	5	1
27	44	1

20	27	2

12	31	2

20	3	2

29	27	2

35	36	2

45	46	2

28	29	3


12	44	4
43	48	4
20	19	5
10	11	5

Looking at lines with 1 shared SNP, found that GQ scores were low:

Without Line 21							AD	DP	GQ	GT	AD	DP	GQ	GT
Line 1	Line 2	# Shared SNPs					Line 1				Line 2
28	9	1	Spar_II_RaGOO	553279	CA	C	211.49	71,14	94	37	CA/C	67,15	95	88	CA/C
28	27	1	Spar_III_RaGOO	11841	G	GT	60.16	102,15	119	40	G/GT	33,6	39	42	G/GT
44	5	1	Spar_V_RaGOO	29734	C	CA	83.27	9,3	12	27	C/CA	9,3	12	27	C/CA
17	37	1	Spar_VI_RaGOO	21351	CA	C	153.15	33,5	38	35	CA/C	115,18	133	94	CA/C
20	5	1	Spar_VI_RaGOO	188146	CT	C	70.21	56,9	68	33	CT/C	24,5	29	44	CT/C
44	9	1	Spar_XI_RaGOO	345809	GA	G	41.44	12,3	15	40	GA/G	90,13	103	58	GA/G
18	5	1	Spar_XII_RaGOO	197446	AT	A	117.67	190,29	234	60	AT/A	41,6	51	2	AT/A
45	44	1	Spar_XIII_RaGOO	307069	A	AT	132.01	132,22	161	44	A/AT	8,4	12	62	A/AT
28	5	1	Spar_XIII_RaGOO	906587	G	GA	129.16	100,17	126	53	G/GA	36,7	44	61	G/GA
27	44	1	Spar_XIV_RaGOO	446172	C	CT	80.2	17,4	24	34	C/CT	19,3	22	34	C/CT

Looking at lines that shared 2 SNPs, found that a particular region on chromosome IV is common: Without Line 21							AD	DP	GQ	GT	AD	DP	GQ	GT
Line 1	Line 2	# Shared SNPs					Line 1				Line 2
20	27	2	Spar_IV_RaGOO	526131	C	CT	38	52,9	63	49	C/CT	27,4	35	3	C/CT
--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--
			Spar_IV_RaGOO	1446960	AT	A	57.25	75,12	91	8	AT/A	26,7	33	87	AT/A
12	31	2	Spar_IV_RaGOO	1460056	G	A	170.73	35,4	39	99	G/A	138,15	153	99	G/A
			Spar_IV_RaGOO	1460068	A	G	78.13	37,3	40	15	A/G	127,12	139	99	A/G
20	3	2	Spar_IV_RaGOO	1473021	G	C	310.74	80,9	89	99	G/C	100,13	113	99	G/C
			Spar_IV_RaGOO	1473039	GT	G	283.69	85,9	94	99	GT/G	106,13	119	99	GT/G
29	27	2	Spar_VII_RaGOO	65449	TAC	T	69.82	64,8	72	70	TAC/T	34,5	39	62	TAC/T
			Spar_XV_RaGOO	88274	CAT	C	55.76	54,7	61	87	CAT/C	37,4	41	31	CAT/C
35	36	2	Spar_VII_RaGOO	72225	T	C	26841.28	1,341	342	99	C/C	2,307	309	99	C/C
			Spar_XII_RaGOO	930564	G	A	21982.28	1,290	291	99	A/A	0,260	260	99	A/A
45	46	2	Spar_VIII_RaGOO	282813	G	T	34727.28	0,182	182	99	T/T	4,696	700	99	T/T
			Spar_XV_RaGOO	912454	C	T	24484.28	0,152	152	99	T/T	0,460	460	99	T/T

Looking at ancestor calls in these sites, is GQ low? Does it look heterozygous but was called homozygous?

								Ancestor
20	27	2	Spar_IV_RaGOO	526131	C	CT	38	145,0	145	0	C/C
			Spar_IV_RaGOO	1446960	AT	A	57.25	131,13	144	99	AT/AT
12	31	2	Spar_IV_RaGOO	1460056	G	A	170.73	109,0	109	99	G/G
			Spar_IV_RaGOO	1460068	A	G	78.13	109,0	109	99	A/A
20	3	2	Spar_IV_RaGOO	1473021	G	C	310.74	109,0	109	99	G/G
			Spar_IV_RaGOO	1473039	GT	G	283.69	143,0	143	96	GT/GT
29	27	2	Spar_VII_RaGOO	65449	TAC	T	69.82	102,0	102	99	TAC/TAC
			Spar_XV_RaGOO	88274	CAT	C	55.76	86,0	86	99	CAT/CAT
35	36	2	Spar_VII_RaGOO	72225	T	C	26841.28	106,0	106	99	T/T
			Spar_XII_RaGOO	930564	G	A	21982.28	106,0	106	99	G/G
45	46	2	Spar_VIII_RaGOO	282813	G	T	34727.28	109,0	109	99	G/G
			Spar_XV_RaGOO	912454	C	T	24484.28	103,0	103	99	C/C

Only one site has a very low GQ score in the ancestor, the rest all look like confident calls.

Going to talk to Dave and decide what to filter out based on GQ scores in the ancestor: Plotted the scores in a distribution to determine what cutoffs to use:

hollygene commented 4 years ago

Issue with a lot of the sites that appear in the cross-contaminants is that they are called heterozygous but are very clearly not (i.e. the allele depth is 110,10 and it calls it het)

Dave is going to work on this - used a script for gene conversion project that solved this issue

Also, there are several sites where there are two alternate alleles, but this is not possible so we are removing those

Used --restrict-alleles-to BIALLELIC in GATK's SelectVariants
For D0: decreased total variant sites from 471 to 443 in VCF file

hollygene commented 4 years ago

After some analysis and distributions of allele frequencies, Dave noticed an issue with the samples in which 3 out of the 4 of them look to be haploid Based on the following table:

I can go back and update this later, because I think the labeling of the columns in this file is off, but basically it looks like in H0, D0, and D1 there are less heterozygous sites than would be expected (i.e. basically none). This makes sense in haploids as a new mutation would definitely be homozygous, but the opposite should be true in diploids - where a new mutation should almost always be heterozygous. That's what is shown in D20 so we're pretty confident D20 is correct.

I am going back and re-running one of my scripts to see if I had a weird error somewhere to have caused this, and am thinking of looking at the MAT locus to differentiate between haploid and diploid and/or looking at any aneuploidy in the samples.

@cbergman and @JingxuanChen7 , if you guys have any advice on how to go about this or what could cause it (besides me being dumb at the beginning of my experiment), please let me know!

Also, once we get back into lab we can test the strains in the freezer and see if they mate (if they do = def haploid, if they don't = def diploid), we will definitely do this before we publish but would like to know computationally asap so we can move onto actual analyses.

hollygene commented 4 years ago

HML.R_depth.txt

Analyzed HMR and HML loci for coverage. Method: masked the MAT locus in genome 337 re-aligned reads only to HMR and HML loci https://github.com/hollygene/TE_MA/blob/TE_MA/maskFASTA.sh

Put resulting data into Excel, and found ratios of the coverage between the two loci (file attached). All H0 samples are haploid alpha All D0 samples are haploid alpha All D1 samples are haploid alpha D20 samples are diploid except for: D20-3, 29, 30: Haploid alpha D20-4: Haploid a (likely 1938)

My guess is that there were heterogeneous cell populations at the beginning of the experiment. The ancestor spike-in data, looks to be D20- haploid alpha D1- diploid D0-diploid H0-diploid

So, clearly there are some issues here.

For now, I am going to focus my efforts on finding locations of TEs in the samples I am confident in, and plotting them layered with coverage plots to see if there is any correlation between TE location and chromosome ploidy/breakpoints (there are several partial duplications/deletions).

There are plans to do a mating experiment with the ancestor samples in the freezer, but I am not going to worry about that yet and focus on the dataset I currently have, which is now 2 haploid Ty-less strains, 1 haploid strain with 1 Ty1, and one diploid strain with ~15 Ty1s.

Dave is working on fixing some of the issues caused by the variant caller relating to allele frequencies and weird calls (i.e. some of them are 80/20 called homozygous, some are 90/10 called heterozygous?)

JingxuanChen7 commented 4 years ago

Re-run whole dataset with McClintock

Finished re-run the whole renamed dataset. Outputs for McC here /scratch/jc33471/paradoxusHolly/run0727/out
I run coverage and TE-locate module.
Re-generating the bar plot of Ty copy number estimation for internal region and LTR.
@hollygene , I am not sure how you would present these results. So please let me know if you want to make any modification on these plots. Also feel free to use my R code if you want to do any modification on your own.

LTR

The copy number estimation results used for plotting here: /scratch/jc33471/paradoxusHolly/run0727/internal.tsv and /scratch/jc33471/paradoxusHolly/run0727/LTR.tsv
R script: https://github.com/bergmanlab/jingxuan/blob/master/src/rscripts/cov_337.R
shell script: https://github.com/bergmanlab/jingxuan/blob/master/src/shell/holly_sep_rep.sh
commit: https://github.com/bergmanlab/jingxuan/commit/072e5972f9cf7c230e7568d39d169d6c7f6dad00

hollygene / TE_MA

Paradoxus MA Information #2

Re-run whole dataset with McClintock

Line 1	Line 2	# Shared SNPs
21	27	12
19	20	5
10	11	5
12	44	4
21	5	4
12	21	4
43	48	4
3	21	3
21	44	3
28	29	3
27	28	2
20	27	2
12	31	2
3	20	2
35	36	2
21	18	2
45	46	2
21	34	2
20	21	2
28	9	1
21	29	1
44	5	1
17	37	1
20	5	1
15	27	1
29	27	1
21	28	1
44	9	1
18	5	1
44	45	1
21	22	1
28	5	1
27	44	1

Without Line 21 Line 1	Line 2	# Shared SNPs
28	9	1
28	27	1
44	5	1
17	37	1
20	5	1
44	9	1
18	5	1
45	44	1
28	5	1
27	44	1

20	27	2

12	31	2

20	3	2

29	27	2

35	36	2

45	46	2

28	29	3


12	44	4
43	48	4
20	19	5
10	11	5

Line 1	Line 2	# Shared SNPs
21	27	12
19	20	5
10	11	5
12	44	4
21	5	4
12	21	4
43	48	4
3	21	3
21	44	3
28	29	3
27	28	2
20	27	2
12	31	2
3	20	2
35	36	2
21	18	2
45	46	2
21	34	2
20	21	2
28	9	1
21	29	1
44	5	1
17	37	1
20	5	1
15	27	1
29	27	1
21	28	1
44	9	1
18	5	1
44	45	1
21	22	1
28	5	1
27	44	1

Without Line 21 Line 1	Line 2	# Shared SNPs
28	9	1
28	27	1
44	5	1
17	37	1
20	5	1
44	9	1
18	5	1
45	44	1
28	5	1
27	44	1

20	27	2

12	31	2

20	3	2

29	27	2

35	36	2

45	46	2

28	29	3


12	44	4
43	48	4
20	19	5
10	11	5

Line 1	Line 2	# Shared SNPs
21	27	12
19	20	5
10	11	5
12	44	4
21	5	4
12	21	4
43	48	4
3	21	3
21	44	3
28	29	3
27	28	2
20	27	2
12	31	2
3	20	2
35	36	2
21	18	2
45	46	2
21	34	2
20	21	2
28	9	1
21	29	1
44	5	1
17	37	1
20	5	1
15	27	1
29	27	1
21	28	1
44	9	1
18	5	1
44	45	1
21	22	1
28	5	1
27	44	1

Without Line 21 Line 1	Line 2	# Shared SNPs
28	9	1
28	27	1
44	5	1
17	37	1
20	5	1
44	9	1
18	5	1
45	44	1
28	5	1
27	44	1

20	27	2

12	31	2

20	3	2

29	27	2

35	36	2

45	46	2

28	29	3


12	44	4
43	48	4
20	19	5
10	11	5