Open hollygene opened 5 years ago
I'll talk to Dave today or tomorrow about this, and see what he thinks about re-sequencing the ancestors, but I think it might be a good idea (maybe even re-extracting the DNA?).
I will think about the masking mat locus experiment and come up with some ideas, and I will let you know
I was wondering about the Ty2 copy numbers, so it's probably a good idea to take those out of there since it was kind of confusing
I will start SNP calling on @JingxuanChen7 's bam files and will post updates when they are done!
Hi @cbergman and @JingxuanChen7 ,
I'm trying to make a file of hard-to-map sites that I can exclude from my analysis (centromeres, telomeres, mating type region, rDNA, etc), and I found a .gff file for the centromeres but do you happen to know which (if any) files of the new 337 reference genome would be useful for this sort of thing?
For reference I'm looking in the directory: /scratch/hcm14449/TE_MA_Paradoxus/ref_genome/paradoxus/337Ref
Thank you!!! I hope you're staying healthy! -Holly
Hi Both
Jingxuan, how hard do you think it is for Holly to run the X and Y’ element annotation LRSDAY module vs you adding this to your pipeline?
Holly, if your goal is to annotate/exclude hard to map regions from your SNP calling, it might be more straightforward to compute a 'mappability" track that identifies regions that are hard to map reads to unambiguously to regardless of their functional annotation. There are a few tools to do this that I can point you to. This strategy has the advantage of allowing you to masking all unmappable regions (telomeres, TEs, recent duplicates, etc.).
Thoughts, Casey
On Mar 18, 2020, at 16:47, Holly Celina Mcqueary hmcqueary@uga.edu wrote:
Okay, are telomere annotations alone available? Also is it okay if I use the LRSDAY package in your directory?
Thank you!! -Holly
Holly McQueary PhD Candidate
Hall Lab Department of Genetics University of Georgia From: Jingxuan Chen JingxuanChen@uga.edu Sent: Wednesday, March 18, 2020 4:33 PM To: Holly Celina Mcqueary hmcqueary@uga.edu Cc: Casey Bergman cbergman@uga.edu Subject: Re: telomere and X/Y prime elements annotation for 337
Hi Holly,
I checked LRSDAY pipeline. X and Y’ element annotations are available but not for telomere annotation. Please feel free to let me know if you need any help.
Best, Jingxuan
On Mar 18, 2020, at 4:24 PM, Holly Celina Mcqueary hmcqueary@uga.edu wrote:
Hey Jingxuan,
I realized a bit too late that I need additional annotations for 337 reference genome - specifically telomere and X and Y' elements. I'm going to try some of your pipeline and that of the protocol to generate these, and I will do it in my own directories, but I thought I would keep you updated!!
Hope all is going well for you!
Best, Holly
Holly McQueary PhD Candidate
Hall Lab Department of Genetics University of Georgia
On Mar 17, 2020, at 13:16, Holly McQueary notifications@github.com wrote:
Hi @cbergman and @JingxuanChen7 ,
I'm trying to make a file of hard-to-map sites that I can exclude from my analysis (centromeres, telomeres, mating type region, rDNA, etc), and I found a .gff file for the centromeres but do you happen to know which (if any) files of the new 337 reference genome would be useful for this sort of thing?
For reference I'm looking in the directory: /scratch/hcm14449/TE_MA_Paradoxus/ref_genome/paradoxus/337Ref
Thank you!!! I hope you're staying healthy! -Holly
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Hi all,
Casey, yes my exact goal is to exclude regions that are hard to map. In previous studies we've used the SGD_features.tab file from SGD (an example here https://sgd-prod-upload.s3.amazonaws.com/S000211725/SGD_features.20120121.tab.gz ), but if there is a more sophisticated way to do this (especially with our own reference genome), I would much rather do that!!
Could you recommend a tool to do this?
Thank you!! -Holly
jc33471@n561 mappability$ gem-indexer -T $PBS_NP -c dna -i genome.337.fasta -o ${prefix}_index
Welcome to GEM-indexer build 1.423 (beta) - (2013/04/01 01:02:13 GMT)
(c) 2008-2013 Paolo Ribeca <paolo.ribeca@gmail.com>
(c) 2010-2013 Santiago Marco Sola <santiagomsola@gmail.com>
(c) 2010-2013 Leonor Frias Moya <leonor.frias@gmail.com>
For the terms of use, run the program with the option --show-license.
************************************************************************
Creating sequence and location files...Fatal error: exception Failure("Command 'echo "-------gem-indexer_fasta2meta+cont" > 337_index.log && gem-indexer_fasta2meta+cont -i genome.337.fasta -c dna --filter-function iupac-dna --strip-unknown-bases-threshold 50 --complement-size-threshold 2000000000 --complement emulat
Begin forwarded message:
From: Jingxuan Chen jc33471@uga.edu<mailto:jc33471@uga.edu> Subject: Re: Questions about running GEM-library Date: March 19, 2020 at 3:38:58 PM EDT To: Guilherme Dias Guilherme.Dias@uga.edu<mailto:Guilherme.Dias@uga.edu>
Thank you so much! I also reviewed my fasta file and found empty lines. I simply remove empty lines and it works! I will modify the assembly to a more well-formatted fasta before I continue to run following steps.
Thanks, Jingxuan
On Mar 19, 2020, at 3:35 PM, Guilherme Dias Guilherme.Dias@uga.edu<mailto:Guilherme.Dias@uga.edu> wrote:
Alright, the problem seems to be that your fasta sequences are single-line (I guess RaGOO makes them like that), and GEM needs them to be multi-line.
I converted your genome to multi-line and GEM indexer and mappability worked here. I used seqtk seq -A -l 80 input.fasta > output.fasta
, where -l
Let me know if that works for you!
best, Gui
On 19 Mar 2020, at 15:24, Jingxuan Chen JingxuanChen@uga.edu<mailto:JingxuanChen@uga.edu> wrote:
Yes! The log file says fasta input does not work. The genome is here /scratch/jc33471/pilon/337/mappability/genome.337.fasta
.
Thanks, Jingxuan
On Mar 19, 2020, at 3:19 PM, Guilherme Dias Guilherme.Dias@uga.edu<mailto:Guilherme.Dias@uga.edu> wrote:
Hi Jingxuan,
I tested running GEM on my side (using the sacCer3 reference genome) and it is working ok. I think this might be some problem with the fasta formatting. I remember GEM was a bit sensitive about it.
Can you point me to the file you are trying to index/calculate mappability? I can take a look.
Gui
On 19 Mar 2020, at 14:17, Jingxuan Chen JingxuanChen@uga.edu<mailto:JingxuanChen@uga.edu> wrote:
Thank you so much for your help!
Jingxuan
On Mar 19, 2020, at 2:14 PM, Guilherme Dias Guilherme.Dias@uga.edu<mailto:Guilherme.Dias@uga.edu> wrote:
Hi Jingxuan,
I don’t remember running into this issue. I’ll do some testing and get back to you!
best, Gui
On 19 Mar 2020, at 14:05, Jingxuan Chen JingxuanChen@uga.edu<mailto:JingxuanChen@uga.edu> wrote:
Hi Gui,
I am now trying to run GEMtools to compute mappability for 337 assembly using the code from your repository https://github.com/bergmanlab/guilherme/blob/dbc4fadad30170d1ea70b213574136753014084b/src/yeast_ploidy.sh#L79 I encountered a problem in the indexing step as below:
jc33471@n561 mappability$ gem-indexer -T $PBS_NP -c dna -i genome.337.fasta -o ${prefix}_index
Welcome to GEM-indexer build 1.423 (beta) - (2013/04/01 01:02:13 GMT)
(c) 2008-2013 Paolo Ribeca <paolo.ribeca@gmail.com<mailto:paolo.ribeca@gmail.com>>
(c) 2010-2013 Santiago Marco Sola <santiagomsola@gmail.com<mailto:santiagomsola@gmail.com>>
(c) 2010-2013 Leonor Frias Moya <leonor.frias@gmail.com<mailto:leonor.frias@gmail.com>>
For the terms of use, run the program with the option --show-license.
************************************************************************
* WARNING: this is a beta version, provided for testing purposes only; *
* check for updates at <http://gemlibrary.sourceforge.net<http://gemlibrary.sourceforge.net/>>. *
************************************************************************
Creating sequence and location files...Fatal error: exception Failure("Command 'echo "-------gem-indexer_fasta2meta+cont" > 337_index.log && gem-indexer_fasta2meta+cont -i genome.337.fasta -c dna --filter-function iupac-dna --strip-unknown-bases-threshold 50 --complement-size-threshold 2000000000 --complement emulat
I did a quick search. It seems that this is a common issue among several people https://www.biostars.org/p/311569/ . Do you have any idea about this error? I was running with 10 threads and 60 GB memory.
Hope that everything goes well with you.
Thank you so much, Jingxuan
/scratch/jc33471/pilon/337/mappability
This is fantastic, thank you so much @JingxuanChen7 !
The read length should be 150 bp - it is paired-end so I guess after combining the R1 and R2 fastqs it should be 300? (I'm not confident that's how that works)
I am using raw data for my pipeline.
Thank you!
@cbergman Do you have a recommendation for the step in my pipeline that I remove these regions? (i.e. early like at bam stage or later at vcfs?)
/scratch/jc33471/pilon/337/mappability
337_lowmappability.bed
is the regions with mappability < 0.9.Updates from 3/19 to present: (Had this in a Word doc, but will update on here from now on)
3/19/20
3/26/20
3/27/20
File Length D0_FullCohort.vcf 1595 Raw variants D0_reducedGEM.vcf 1042 Removed low-mappabiltiy sites D0_reducedGEM_DpGr10_Fil.vcf 827 Removed sites with depth <10 D0_reducedGEM_DpGr10_Fil_AncCalls.vcf 811 Removed sites with ancestor no-calls (./.) D0_reducedGEM_DpGr10_Fil_AncCalls_NoHets.vcf 664 Removed sites with ancestor genotype as Heterozygous
Alternative method: D0_FullCohort.vcf 1595 Raw variants D0_FullCohort_DpGr10_Fil.vcf 1128 Remove sites with depth < 10 D0_FullCohort_DpGr10_Fil_AncCalls.vcf 1094 Remove sites with ancestor no-calls (./.) D0_FullCohort_DpGr10_Fil_AncCalls_NoHets.vcf 854 Remove sites with ancestor genotype as heterozygous D0_FullCohort_DpGr10_Fil_AncCalls_NoHets_GEM.vcf 664 Removed low-mappability sites
Checked to make sure they both gave the same thing – they did (exact same results)
Using the First method, because it seems to make more sense (remove low-mapping sites first instead of at the end)
4/1/20
Chromosome I depth distribution (x axis is location on chr I, y axis is depth)
Chromosome II:
Chromosome III:
Chromosome IV:
Chromosome V:
Chromosome VI:
Chromosome VII:
Chromosome VIII:
Chromosome IX:
Chromosome X:
Chromosome XI:
Chromosome XII:
Chromosome XIII:
Chromosome XIV:
Chromosome XV:
Chromosome XVI:
Filtering of Variants D0 Low depth: 82 High depth: 206
File Length D0_FullCohort.vcf 1595 D0_noLow.vcf 962 D0_noLow_noHigh.vcf 772 D0_noLow_noHigh_redGEM.vcf 580 D0_noLow_noHigh_redGEM_AncCalls.vcf 573 D0_noLow_noHigh_redGEM_AncCalls_NoHets.vcf 471
D1 Low depth: 70 High depth: 188
File Length D1_FullCohort.vcf 2695 D1_noLow.vcf 1586 D1_noLow_noHigh.vcf 1346 D1_noLow_noHigh_redGEM.vcf 971 D1_noLow_noHigh_redGEM_AncCalls.vcf 958 D1_noLow_noHigh_redGEM_AncCalls_NoHets.vcf 812
D20 Low depth: 82 High depth: 206
File Length D20_FullCohort.vcf 1835 D20_noLow.vcf 1251 D20_noLow_noHigh.vcf 963 D20_noLow_noHigh_redGEM.vcf 741 D20_noLow_noHigh_redGEM_AncCalls.vcf 721 D20_noLow_noHigh_redGEM_AncCalls_NoHets.vcf 526
Found variants in the ancestors:
D20 ancestor variants
Chr Pos GT Type VII 61782 1/1 ALT – indel XVI 186933 1/1 ALT – indel
D1 ancestor variants
Chr Pos GT Type IV 641630 2/2 ALT – indel (het in D20 & majority of progeny) VII 61782 1/1 ALT – indel VII 986079 1/1 ALT – SNP (not in D20) XV 299062 1/1 ALT – SNP (not in D20) XVI 186933 2/2 ALT – SNP
D0 ancestor variants
Chr Pos GT Type IV 641630 2/2 ALT – indel VII 61782 1/1 ALT – indel XIII 111478 ½ ALT – indel XV 314234 ½ ALT – indel
4/3/20 Heterozygous sites shared between the 3 diploid ancestors: 40 total File: /Users/hollymcqueary/Dropbox/McQueary/Paradoxus_MA/VCFs/AncHets/D0_D1_D20_common.txt
4/7/20 H0 median depth = 153
Filtering based on depth of 150 though, so removing anything less than 90 and greater than 218
H0 Final Cohort Variant Info:
hcm14449@n204 H0$ wc -l H0_FullCohort.vcf 1729 H0_FullCohort.vcf hcm14449@n204 H0$ pwd /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/Out/H0
1729 raw variants
Remove sites with depth less than 90:
hcm14449@n204 H0$ wc -l H0_noLow.vcf 1080 H0_noLow.vcf
Remove sites with depth greater than 218
hcm14449@n204 H0$ wc -l H0_noLow_noHigh.vcf 889 H0_noLow_noHigh.vcf
Remove hard-to-map regions called by GEM:
hcm14449@n204 H0$ wc -l H0_noLow_noHigh_redGem.vcf 668 H0_noLow_noHigh_redGem.vcf
Removed sites with no calls in the ancestor:
hcm14449@n204 H0$ wc -l H0_noLow_noHigh_redGem_AncCalls.vcf 654 H0_noLow_noHigh_redGem_AncCalls.vcf
Removed sites with hets in the ancestor
hcm14449@n204 H0$ wc -l H0_noLow_noHigh_redGem_AncCalls_NoHets.vcf 545 H0_noLow_noHigh_redGem_AncCalls_NoHets.vcf
SNPs:
hcm14449@n204 H0$ wc -l H0_noLow_noHigh_redGem_AncCalls_NoHets_SNPs.vcf
344 H0_noLow_noHigh_redGem_AncCalls_NoHets_SNPs.vcf
Indels:
hcm14449@n204 H0$ wc -l H0_noLow_noHigh_redGem_AncCalls_NoHets_Indels.vcf 223 H0_noLow_noHigh_redGem_AncCalls_NoHets_Indels.vcf
Looked at cross-contaminants in D0 samples
Summary:
Line 1 | Line 2 | # Shared SNPs |
---|---|---|
21 | 27 | 12 |
19 | 20 | 5 |
10 | 11 | 5 |
12 | 44 | 4 |
21 | 5 | 4 |
12 | 21 | 4 |
43 | 48 | 4 |
3 | 21 | 3 |
21 | 44 | 3 |
28 | 29 | 3 |
27 | 28 | 2 |
20 | 27 | 2 |
12 | 31 | 2 |
3 | 20 | 2 |
35 | 36 | 2 |
21 | 18 | 2 |
45 | 46 | 2 |
21 | 34 | 2 |
20 | 21 | 2 |
28 | 9 | 1 |
21 | 29 | 1 |
44 | 5 | 1 |
17 | 37 | 1 |
20 | 5 | 1 |
15 | 27 | 1 |
29 | 27 | 1 |
21 | 28 | 1 |
44 | 9 | 1 |
18 | 5 | 1 |
44 | 45 | 1 |
21 | 22 | 1 |
28 | 5 | 1 |
27 | 44 | 1 |
Line 21 shares SNPs with a lot of other lines - problem with de-multiplexing maybe? If I remove line 21, this is the results:
Without Line 21 Line 1 | Line 2 | # Shared SNPs |
---|---|---|
28 | 9 | 1 |
28 | 27 | 1 |
44 | 5 | 1 |
17 | 37 | 1 |
20 | 5 | 1 |
44 | 9 | 1 |
18 | 5 | 1 |
45 | 44 | 1 |
28 | 5 | 1 |
27 | 44 | 1 |
20 | 27 | 2 |
12 | 31 | 2 |
20 | 3 | 2 |
29 | 27 | 2 |
35 | 36 | 2 |
45 | 46 | 2 |
28 | 29 | 3 |
12 | 44 | 4 |
43 | 48 | 4 |
20 | 19 | 5 |
10 | 11 | 5 |
Looking at lines with 1 shared SNP, found that GQ scores were low:
Without Line 21 | AD | DP | GQ | GT | AD | DP | GQ | GT | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Line 1 | Line 2 | # Shared SNPs | Line 1 | Line 2 | |||||||||||
28 | 9 | 1 | Spar_II_RaGOO | 553279 | CA | C | 211.49 | 71,14 | 94 | 37 | CA/C | 67,15 | 95 | 88 | CA/C |
28 | 27 | 1 | Spar_III_RaGOO | 11841 | G | GT | 60.16 | 102,15 | 119 | 40 | G/GT | 33,6 | 39 | 42 | G/GT |
44 | 5 | 1 | Spar_V_RaGOO | 29734 | C | CA | 83.27 | 9,3 | 12 | 27 | C/CA | 9,3 | 12 | 27 | C/CA |
17 | 37 | 1 | Spar_VI_RaGOO | 21351 | CA | C | 153.15 | 33,5 | 38 | 35 | CA/C | 115,18 | 133 | 94 | CA/C |
20 | 5 | 1 | Spar_VI_RaGOO | 188146 | CT | C | 70.21 | 56,9 | 68 | 33 | CT/C | 24,5 | 29 | 44 | CT/C |
44 | 9 | 1 | Spar_XI_RaGOO | 345809 | GA | G | 41.44 | 12,3 | 15 | 40 | GA/G | 90,13 | 103 | 58 | GA/G |
18 | 5 | 1 | Spar_XII_RaGOO | 197446 | AT | A | 117.67 | 190,29 | 234 | 60 | AT/A | 41,6 | 51 | 2 | AT/A |
45 | 44 | 1 | Spar_XIII_RaGOO | 307069 | A | AT | 132.01 | 132,22 | 161 | 44 | A/AT | 8,4 | 12 | 62 | A/AT |
28 | 5 | 1 | Spar_XIII_RaGOO | 906587 | G | GA | 129.16 | 100,17 | 126 | 53 | G/GA | 36,7 | 44 | 61 | G/GA |
27 | 44 | 1 | Spar_XIV_RaGOO | 446172 | C | CT | 80.2 | 17,4 | 24 | 34 | C/CT | 19,3 | 22 | 34 | C/CT |
Looking at lines that shared 2 SNPs, found that a particular region on chromosome IV is common: Without Line 21 | AD | DP | GQ | GT | AD | DP | GQ | GT | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Line 1 | Line 2 | # Shared SNPs | Line 1 | Line 2 | |||||||||||
20 | 27 | 2 | Spar_IV_RaGOO | 526131 | C | CT | 38 | 52,9 | 63 | 49 | C/CT | 27,4 | 35 | 3 | C/CT |
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
Spar_IV_RaGOO | 1446960 | AT | A | 57.25 | 75,12 | 91 | 8 | AT/A | 26,7 | 33 | 87 | AT/A | |||
12 | 31 | 2 | Spar_IV_RaGOO | 1460056 | G | A | 170.73 | 35,4 | 39 | 99 | G/A | 138,15 | 153 | 99 | G/A |
Spar_IV_RaGOO | 1460068 | A | G | 78.13 | 37,3 | 40 | 15 | A/G | 127,12 | 139 | 99 | A/G | |||
20 | 3 | 2 | Spar_IV_RaGOO | 1473021 | G | C | 310.74 | 80,9 | 89 | 99 | G/C | 100,13 | 113 | 99 | G/C |
Spar_IV_RaGOO | 1473039 | GT | G | 283.69 | 85,9 | 94 | 99 | GT/G | 106,13 | 119 | 99 | GT/G | |||
29 | 27 | 2 | Spar_VII_RaGOO | 65449 | TAC | T | 69.82 | 64,8 | 72 | 70 | TAC/T | 34,5 | 39 | 62 | TAC/T |
Spar_XV_RaGOO | 88274 | CAT | C | 55.76 | 54,7 | 61 | 87 | CAT/C | 37,4 | 41 | 31 | CAT/C | |||
35 | 36 | 2 | Spar_VII_RaGOO | 72225 | T | C | 26841.28 | 1,341 | 342 | 99 | C/C | 2,307 | 309 | 99 | C/C |
Spar_XII_RaGOO | 930564 | G | A | 21982.28 | 1,290 | 291 | 99 | A/A | 0,260 | 260 | 99 | A/A | |||
45 | 46 | 2 | Spar_VIII_RaGOO | 282813 | G | T | 34727.28 | 0,182 | 182 | 99 | T/T | 4,696 | 700 | 99 | T/T |
Spar_XV_RaGOO | 912454 | C | T | 24484.28 | 0,152 | 152 | 99 | T/T | 0,460 | 460 | 99 | T/T |
Looking at ancestor calls in these sites, is GQ low? Does it look heterozygous but was called homozygous?
Ancestor | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
20 | 27 | 2 | Spar_IV_RaGOO | 526131 | C | CT | 38 | 145,0 | 145 | 0 | C/C |
Spar_IV_RaGOO | 1446960 | AT | A | 57.25 | 131,13 | 144 | 99 | AT/AT | |||
12 | 31 | 2 | Spar_IV_RaGOO | 1460056 | G | A | 170.73 | 109,0 | 109 | 99 | G/G |
Spar_IV_RaGOO | 1460068 | A | G | 78.13 | 109,0 | 109 | 99 | A/A | |||
20 | 3 | 2 | Spar_IV_RaGOO | 1473021 | G | C | 310.74 | 109,0 | 109 | 99 | G/G |
Spar_IV_RaGOO | 1473039 | GT | G | 283.69 | 143,0 | 143 | 96 | GT/GT | |||
29 | 27 | 2 | Spar_VII_RaGOO | 65449 | TAC | T | 69.82 | 102,0 | 102 | 99 | TAC/TAC |
Spar_XV_RaGOO | 88274 | CAT | C | 55.76 | 86,0 | 86 | 99 | CAT/CAT | |||
35 | 36 | 2 | Spar_VII_RaGOO | 72225 | T | C | 26841.28 | 106,0 | 106 | 99 | T/T |
Spar_XII_RaGOO | 930564 | G | A | 21982.28 | 106,0 | 106 | 99 | G/G | |||
45 | 46 | 2 | Spar_VIII_RaGOO | 282813 | G | T | 34727.28 | 109,0 | 109 | 99 | G/G |
Spar_XV_RaGOO | 912454 | C | T | 24484.28 | 103,0 | 103 | 99 | C/C |
Only one site has a very low GQ score in the ancestor, the rest all look like confident calls.
Going to talk to Dave and decide what to filter out based on GQ scores in the ancestor: Plotted the scores in a distribution to determine what cutoffs to use:
Issue with a lot of the sites that appear in the cross-contaminants is that they are called heterozygous but are very clearly not (i.e. the allele depth is 110,10 and it calls it het)
Also, there are several sites where there are two alternate alleles, but this is not possible so we are removing those
Used --restrict-alleles-to BIALLELIC in GATK's SelectVariants
For D0: decreased total variant sites from 471 to 443 in VCF file
I can go back and update this later, because I think the labeling of the columns in this file is off, but basically it looks like in H0, D0, and D1 there are less heterozygous sites than would be expected (i.e. basically none). This makes sense in haploids as a new mutation would definitely be homozygous, but the opposite should be true in diploids - where a new mutation should almost always be heterozygous. That's what is shown in D20 so we're pretty confident D20 is correct.
I am going back and re-running one of my scripts to see if I had a weird error somewhere to have caused this, and am thinking of looking at the MAT locus to differentiate between haploid and diploid and/or looking at any aneuploidy in the samples.
@cbergman and @JingxuanChen7 , if you guys have any advice on how to go about this or what could cause it (besides me being dumb at the beginning of my experiment), please let me know!
Also, once we get back into lab we can test the strains in the freezer and see if they mate (if they do = def haploid, if they don't = def diploid), we will definitely do this before we publish but would like to know computationally asap so we can move onto actual analyses.
Analyzed HMR and HML loci for coverage. Method: masked the MAT locus in genome 337 re-aligned reads only to HMR and HML loci https://github.com/hollygene/TE_MA/blob/TE_MA/maskFASTA.sh
Put resulting data into Excel, and found ratios of the coverage between the two loci (file attached). All H0 samples are haploid alpha All D0 samples are haploid alpha All D1 samples are haploid alpha D20 samples are diploid except for: D20-3, 29, 30: Haploid alpha D20-4: Haploid a (likely 1938)
My guess is that there were heterogeneous cell populations at the beginning of the experiment. The ancestor spike-in data, looks to be D20- haploid alpha D1- diploid D0-diploid H0-diploid
So, clearly there are some issues here.
For now, I am going to focus my efforts on finding locations of TEs in the samples I am confident in, and plotting them layered with coverage plots to see if there is any correlation between TE location and chromosome ploidy/breakpoints (there are several partial duplications/deletions).
There are plans to do a mating experiment with the ancestor samples in the freezer, but I am not going to worry about that yet and focus on the dataset I currently have, which is now 2 haploid Ty-less strains, 1 haploid strain with 1 Ty1, and one diploid strain with ~15 Ty1s.
Dave is working on fixing some of the issues caused by the variant caller relating to allele frequencies and weird calls (i.e. some of them are 80/20 called homozygous, some are 90/10 called heterozygous?)
/scratch/jc33471/paradoxusHolly/run0727/out
/scratch/jc33471/paradoxusHolly/run0727/internal.tsv
and /scratch/jc33471/paradoxusHolly/run0727/LTR.tsv
Whole-genome sequencing of S. paradoxus mutation accumulation lines DNA extracted using Zymo YeaSTAR Genomic DNA kit Libraries prepped using homemade protocol (no kit, can find out more info if needed, probably published) Sequenced on NovaSeq S4 flow cell PE 150 at Genewiz 500GB total sequencing data
Location of fastq files on Sapelo:
/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq
(I also have backups of this on the /project/ folder for our lab, and on an external hard drive)So one potential problem I have found is that my samples have varying depth of coverage - I guess because they were pooled. My main concern is that my ancestral strains have ~25x coverage, while the progenitor lines have anywhere from 150x-2000x(!).