Paradoxus MA Information

hollygene commented 5 years ago

Whole-genome sequencing of S. paradoxus mutation accumulation lines DNA extracted using Zymo YeaSTAR Genomic DNA kit Libraries prepped using homemade protocol (no kit, can find out more info if needed, probably published) Sequenced on NovaSeq S4 flow cell PE 150 at Genewiz 500GB total sequencing data

Location of fastq files on Sapelo: /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq (I also have backups of this on the /project/ folder for our lab, and on an external hard drive)

So one potential problem I have found is that my samples have varying depth of coverage - I guess because they were pooled. My main concern is that my ancestral strains have ~25x coverage, while the progenitor lines have anywhere from 150x-2000x(!).

cbergman commented 5 years ago

4 original strains:
- DG1768: a Ty1‐less strain, (MATα his3‐Δ200hisG ura3) derived from strain 337, kindly provided by J. Adams (Wilke and Adams, 1992), as previously described (Garfinkel et al., 2003).
- DG1938 (MATa his3‐Δ200G ura3 Ty1‐less Spo+) was derived by crossing DG1768 with DG1929 [MATa his3‐Δ 200hisG ura3 Ty1his3‐AI(96)] and segregating spores for loss of the Ty1 element. DG1929 is [MATa his3‐Δ 200hisG ura3 Ty1his3‐AI(96)] derived from strain 337, kindly provided by J. Adams (Wilke and Adams, 1992), as previously described (Garfinkel et al., 2003).
- DG4005 (rederived DG2454): MATα his3-Δ200hisG ura3 gal3 Ty1-4253 Ty-less Spo−. DG1768 + Ty1-4523 (inserted in ChrX between RAD7 and CDC8, adjacent to Gly-tRNA gene
- DG2451: DG1768 + ~20-25 Ty1-H3 insertions (estimated by Southern analysis, insertion locations not determined). Papers say 20 in Genetics 2003 and 25 in PNAS 2009
Original strains used to create 4 progenitor:
- H0: Haploid 0 Ty1s (DG1768 α)
- D0: Diploid 0 Ty1s (DG1938 a x DG1768 α)
- D1: Diploid 1 Ty1 (DG1938 a x DG4005 α)
- D20: Diploid ~20 Ty1-H3 insertions (DG1938 a x DG2451 α?)
200 days of transfer, 1 transfer every other day
- 100 transfers, 2000 generations
- stored every 10 transfers
- sequenced 100th transfer only
- goal was 48 replicates per genotype + progenitors = ~200 total samples
- <5 replicates missing from each genotype

cbergman commented 5 years ago

Goal is to get BAM files and estimtes of TE content from each sample
Here is where 337 is on the Spar tree
Which reference to use for mapping (CBS432 vs 337 Pacbio)?
YPS138
- public (GCA_002079115.1)
- annotated
- polished
- from same Spar lineage (SpB)
- no Ty1 present in genome
337 Pacbio
- private (Collaboration btwn Bergman and Garfinkel labs)
- not Illumina polished
- not annotated
- very close to progenitor lineages (SpB)
- no Ty1 present in genome
- less filtering and debugging, overall faster
CBS432 corrected with 337 progentiro reads
- assumes a lot about genome structure being conserved
- if process is not perfect, may have worst of both worlds
Ideal would be to improve 337 and use for mapping
- Pilon polish with 337 Illumina progentior + orgnaize by Chromosome - this is minimal requirement prior to running McC on whole dataset
- Transfer annotation (LRSDAY or trackhub) - this can be done independent of mapping with McC
Masked or unmasked reference genome
- Masked + augmented - will be made by McC coverage module
- unmasked - will be made by McC TEMP pipeline
Note: Spar reference but Scer Ty1. No Spar Ty1 internal in genome, but there will be Spar Ty1 LTRs in genome

cbergman commented 5 years ago

For Pilon polishing, the fastq that should be used is /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/HM-H0-A_R*

cbergman commented 4 years ago

I've made reasonable progress on polishing the 337 genome, but I stalled out and have handed this over to Jingxuan. Can you please invite @JingxuanChen7 to this issue? Thanks!

hollygene commented 4 years ago

Absolutely!

I should be receiving new (higher read coverage) data for the ancestors hopefully by the end of this week. I will let you know when I get it and I will give you the path on sapelo to it!

Thanks!

cbergman commented 4 years ago

@JingxuanChen7 has made good progress on polishing and scaffolding the 337 genome, and has also gotten gene and centromere annotation working using the LRSDAY pipeline. LRSDAY can also annotate other yeast specific feature such as Y elements, core X-elements and do a mtDNA gene annotation. (Just for the record, we will use a different TE annotation pipeline that I have developed.)
@hollygene could you have a read through the LRSDAY paper and see if these additional annotations will be useful for your MA analyses: https://www.nature.com/articles/nprot.2018.025
Once we here back from Holly, Jingxuan can finalize the 337 genome sequence and annotation and Holly should be able to start using this reference to analyze the MA data.

cbergman commented 4 years ago

Progress on 337 polishing and annotation is logged here: https://github.com/bergmanlab/jingxuan/issues/12

hollygene commented 4 years ago

@cbergman @JingxuanChen7 Thank you so much! If you could do the mtDNA gene annotation as well, that would be great. The rest aren't particularly useful for this project.

JingxuanChen7 commented 4 years ago

The genome after scaffolding and removing short contigs here: /scratch/jc33471/pilon/337/annotation/genome.337.fasta

cbergman commented 4 years ago

HGAP assembly script is here: https://github.com/bergmanlab/casey/blob/master/src/scripts/genomics/yeast/DG1768-smrtlink-hgap.sh

JingxuanChen7 commented 4 years ago

5th round of pilon, ragoo, and first run of nuclear gene, mtDNA, and centromere annotation scripts: https://github.com/bergmanlab/jingxuan/blob/master/src/shell/pilon_asm.sh

cbergman commented 4 years ago

first 4 rounds of pilon are here: https://github.com/bergmanlab/casey/blob/master/src/scripts/genomics/yeast/337_pilon.sh

cbergman commented 4 years ago

To be clear, the polished, scaffolded, filtered assembly to be used for gene/mtdna/centromere annotation (JC), SNP calling (HM), TE annotation (CMB), TE abdundance estimate (JC) is: /scratch/jc33471/pilon/337/annotation/genome.337.fasta

cbergman commented 4 years ago

Five rounds of pilon polishing were done with HM-H0-A_R1_001.fastq and HM-H0-A_R2_001.fastq (BWA/0.7.17-foss-2016b, SAMtools/1.9-foss-2018b, pilon/1.22-Java-1.8.0_144)
ragoo was run using uncorrected long reads (whish isn't the proper way to do things, short reads and no reads (RaGOO/1.1-foss-2018a-Python-3.6.4, minimap2/2.17-foss-2018a)
JC will update annotation scripts to use proper ragoo file and run annotation code to completion.
HM and CMB will use the final ragoo assembly that was used for gene/mtdna/centromere annotation after JC updates scripts.
Once final assembly and gene/mtdna/centromere annotation are available, CMB will make back up and make directories readable to HM

cbergman commented 4 years ago

Holly can you make sure that the following files stay in the following locations:

READ1="/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/HM-H0-A_R1_001.fastq"
READ2="/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/HM-H0-A_R2_001.fastq"

also can you post the locations and an explanation of the directory structure for the complete data set after you get your raw data reorganized?

hollygene commented 4 years ago

READ1="/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/HM-H0-A_R1_001.fastq" READ2="/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/HM-H0-A_R2_001.fastq" These files are still in the same directory.

The rest of my raw data is also currently in this directory, but I am going to change that tomorrow and will post the new locations and directory structure!

JingxuanChen7 commented 4 years ago

I am sorry for the messy codes yesterday.
I re-run the whole annotation pipeline (nuclear and mitochondrial gene annotation, centromere annotation) and clean the codes.
The final genome assembly here /scratch/jc33471/pilon/337/annotation/genome.337.fasta
The final gff files are here in this folder /scratch/jc33471/pilon/337/annotation/final_output
- 337.centromere.gff3 for centromere annotation, 337.mitochondrial_gene.updated.gff3 chrMT annotation, 337.nuclear_gene.updated.gff3 for nuclear gene annotation.
Scripts:
- Script for pilon after round 5, ragoo here (search pilon and ragoo part): https://github.com/bergmanlab/jingxuan/blob/master/src/shell/pilon_asm.sh
- Script for annotation pipeline: https://github.com/bergmanlab/jingxuan/blob/master/src/shell/337_annotation.sh
- Script for conda environment: https://github.com/bergmanlab/jingxuan/blob/master/src/shell/337_annot_env.sh

hollygene commented 4 years ago

This was so fast, thank you Jingxuan! @cbergman Does this mean these files are ready, or should I wait to align?

Thanks!

cbergman commented 4 years ago

I made a back up of JC's 337 /scratch directory on /work so that it will be in a more stable location
Let's use the version of the data in /work for variant calling, TE annotation and TE abundance estimation
relevant files are
- reference genome: /work/cmblab/cbergman/337/annotation/genome.337.fasta
- centromere annotation: /work/cmblab/cbergman/337/annotation/final_output/337.centromere.gff3
- nuclear gene annotation: /work/cmblab/cbergman/337/annotation/final_output/337.nuclear_gene.updated.gff3
- mtDNA gene annotation: /work/cmblab/cbergman/337/annotation/final_output/337.mitochondrial_gene.updated.gff3
Holly can you confirm these files are readable?

hollygene commented 4 years ago

Awesome!

I tried viewing them with less but it says permission denied

cbergman commented 4 years ago

Ok, it looks like a permissions issue on /work. I don't see an easy fix to this and will definitely need to work with GACRC to find a solution (something along the lines of you joining our unix group). If you want to get started ASAP. Use the files in @JingxuanChen7's directories. They are identical:

[cbergman@sapelo2] $ diff /scratch/jc33471/pilon/337/annotation/genome.337.fasta /work/cmblab/cbergman/337/annotation/genome.337.fasta
[cbergman@sapelo2] $ diff /scratch/jc33471/pilon/337/annotation/final_output/337.centromere.gff3 /work/cmblab/cbergman/337/annotation/final_output/337.centromere.gff3
[cbergman@sapelo2] $ diff /scratch/jc33471/pilon/337/annotation/final_output/337.mitochondrial_gene.updated.gff3 /work/cmblab/cbergman/337/annotation/final_output/337.mitochondrial_gene.updated.gff3
[cbergman@sapelo2] $ diff /scratch/jc33471/pilon/337/annotation/final_output/337.nuclear_gene.updated.gff3 /work/cmblab/cbergman/337/annotation/final_output/337.nuclear_gene.updated.gff3

In general I prefer to work from a common set of files in /work since files on scratch can get deleted. But in this case I think we should progress and just make sure the the files in @JingxuanChen7's directories don't get modified/deleted.
@JingxuanChen7's can you make your /scratch/jc33471/pilon/337/ dir read-only (chmod a-w -R /scratch/jc33471/pilon/337)? Thanks!

JingxuanChen7 commented 4 years ago

Already made the folder read-only.

jc33471@sapelo2-sub1 ~$ ll /scratch/jc33471/pilon/337
total 108
dr-xr-xr-x.  9 jc33471 cmblab  4096 Jan 15 15:33 ./
drwxr-xr-x.  4 jc33471 cmblab  4096 Jan 24 15:35 ../
dr-xr-xr-x.  7 jc33471 cmblab  4096 Jan 24 12:56 annotation/
dr-xr-xr-x. 11 jc33471 cmblab 12288 Jan 15 16:25 annotation_test/
dr-xr-xr-x.  3 jc33471 cmblab  4096 Jan 15 15:31 data/
dr-xr-xr-x.  2 jc33471 cmblab  4096 Dec 20 14:07 mummer/
dr-xr-xr-x.  2 jc33471 cmblab 20480 Dec 20 12:08 pilon/
dr-xr-xr-x.  5 jc33471 cmblab  4096 Jan  2 14:40 ragoo/
dr-xr-xr-x. 19 jc33471 cmblab 16384 Jan 15 16:47 scripts/

hollygene commented 4 years ago

I'm not positive I'm using the right commands to test this, but I'm still getting permission denied:

hcm14449@sapelo2-sub2 ~$ ls /scratch/jc33471/pilon/337/annotation/
ls: cannot access /scratch/jc33471/pilon/337/annotation/: Permission denied

hcm14449@sapelo2-sub2 ~$ less /scratch/jc33471/pilon/337/annotation/genome.337.fasta
/scratch/jc33471/pilon/337/annotation/genome.337.fasta: Permission denied

cbergman commented 4 years ago

Jingxuan, I think you need to make a higher level scratch directory other readable/executable
```
chmod a+r /scratch/jc33471/
chmod a+x /scratch/jc33471/
```
After Jingxuan does this, Holly please try again

JingxuanChen7 commented 4 years ago

I have made the folder readable and executable. Sorry for the delay.

jc33471@sapelo2-sub2 jingxuan$ chmod a+r /scratch/jc33471
jc33471@sapelo2-sub2 jingxuan$ chmod a+x /scratch/jc33471

hollygene commented 4 years ago

Awesome, this works, thank you!

cbergman commented 4 years ago

@hollygene: We are redy to start running McClintock to estimate Ty copy number across all of the samples. Could you post the location of the final dataset? If the sample names aren't in the file names, could you also post a table mapping samples to files? Thanks!

hollygene commented 4 years ago

Hi, yes sorry to take so long! ALL of the fastq files are in this directory: /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/ They are also separated into each strain: /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/H0 /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/D0 /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/D1 /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/D20

I realize this is really redundant, but I was kind of worried about accidentally deleting some of them so I have copied them in a ton of places.. (and I have backups on external drives)

The individual strain directories contain all of the files that I will be using in my final analysis. The ../00_fastq/ directory contains all of the data I have - including samples that have been resequenced.

I hope that makes sense! I probably will need to make these readable to you guys, I'll try to do that now but if it doesn't work let me know!

The sample names are in the file names. Thank you so much!!

hollygene commented 4 years ago

In case giving y'all access to the /project/ folder isn't possible, these files are also in my /scratch/ directory: /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/H0 /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/D0 /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/D1 /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/D20

JingxuanChen7 commented 4 years ago

I am able to access data in /scratch folders. Thanks!

cbergman commented 4 years ago

Thanks Holly. I think we should sit down in the next day or so and try to remove the redundancy from the data set on /scratch, then make it a read-only archive and document it so we are all on the same page and use the same exact files. Otherwise, we'll get into a difficulty down the road. It's fine to the leave the copy in /project as you have it now. Thursday 1-2 PM and Friday morning 10:30-12 work for me to get together. Could you come down to the lab either of those times?

hollygene commented 4 years ago

Okay, that sounds great! I can come down today around 1 if that still works for you!

hollygene commented 4 years ago

@cbergman @JingxuanChen7

It looks like the gzipping is all done! I'm not currently in my lab, but when I get back to my notes I will start on the next step and getting them transferred to /scratch/.

cbergman commented 4 years ago

great. Before transferring to scratch make the reorganized folders on /project read-only, i.e.
```
chmod a-r -R /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/
```

to rsync from /project to /scratch do something like

rsync -av /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/ /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq

be sure to delete everything in /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/ first (i.e. rm -rf /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/) & make sure the trailing slash is on the sending path but not on the destination path. You might need to make the destination folder first (i.e. mkdir -p /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq) before doing the rsync

hollygene commented 4 years ago

Successfully got everything read-only:

total 1024
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb  7 12:19 D20
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb  7 13:21 H0
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb  7 13:25 D0
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb  7 13:30 D1
dr-xr-xr-x. 3 hcm14449 dwhlab 53248 Feb  7 16:06 ..
dr-xr-xr-x. 6 hcm14449 dwhlab 53248 Feb  7 16:06 .
[hcm14449@xfer3 00_fastq]$ pwd
/project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq
[hcm14449@xfer3 00_fastq]$

Also successfully rsync'd to /scratch:
 [hcm14449@xfer3 00_fastq]$ ls -lrta
total 56
dr-xr-xr-x. 2 hcm14449 dwhlab 12288 Feb  7 12:19 D20
dr-xr-xr-x. 2 hcm14449 dwhlab 12288 Feb  7 13:21 H0
dr-xr-xr-x. 2 hcm14449 dwhlab 12288 Feb  7 13:25 D0
dr-xr-xr-x. 2 hcm14449 dwhlab 12288 Feb  7 13:30 D1
drwxr-xr-x. 3 hcm14449 dwhlab  4096 Feb  7 16:02 ..
dr-xr-xr-x. 6 hcm14449 dwhlab  4096 Feb  7 16:06 .
[hcm14449@xfer3 00_fastq]$ pwd
/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq
[hcm14449@xfer3 00_fastq]$

i.e.

[hcm14449@xfer3 00_fastq]$ ls -lrta D20
total 125869148
-r-xr-xr-x. 1 hcm14449 dwhlab  572729373 Feb  5 11:13 D20-44_R1.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab  601673638 Feb  5 11:13 D20-44_R2.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab  518434568 Feb  5 11:13 D20-A_R1.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab  540250860 Feb  5 11:14 D20-A_R2.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 1997108912 Feb  5 11:27 HM-D20-10_R1_001.fastq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 2040745262 Feb  5 11:27 HM-D20-10_R2_001.fastq.gz

I think you guys should have read access to these, but let me know if something doesn't work! Thanks!!

JingxuanChen7 commented 4 years ago

McClintock coverage module for all 337 sequencing data finished

The script used to submit McC runs and summarize the coverage results here https://github.com/bergmanlab/jingxuan/blob/master/src/shell/holly_sep_rep.sh
The outputs on sapelo2 here /scratch/jc33471/paradoxusHolly/run0210_*
The summarized .tsv files here https://github.com/bergmanlab/jingxuan/tree/master/data/paradoxus/337_outputs
Quick bar plots showing copy numbers in each sample.
- I suspect HM-D0-A, D1-A, D20-A are mislabeled?
Scripts for R plots https://github.com/bergmanlab/jingxuan/blob/master/src/rscripts/cov_337.R

JingxuanChen7 commented 4 years ago

Barplots for LTR copy number estimations.

hollygene commented 4 years ago

Awesome, thank you @JingxuanChen7!

It looks like D20-A is actually D1, D1-A is actually D20, and D0-A is actually D1?

Is that what it looks like to y'all?

cbergman commented 4 years ago

I don't think we should guess about this yet. I think we should analyze all of the samples from the original big sequencing run done last fall. This way we can compare samples that were run in both the original and newer run

Also, I see that there are some new files in /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/H0/

[cbergman@sapelo2] $ ls -lrt /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/H0/ | tail
-r-xr-xr-x. 1 hcm14449 dwhlab  481023003 Feb  5 11:24 H0-7_R2.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab  626363773 Feb  5 11:25 H0-A_R1.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab  668478601 Feb  5 11:25 H0-A_R2.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 2519083690 Feb 11 13:50 HM-H0-10_R1_001.fastq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 2582116750 Feb 11 13:50 HM-H0-10_R2_001.fastq.gz
-rw-r--r--. 1 hcm14449 dwhlab        582 Feb 13 13:27 *_BR.sh
-rw-------. 1 hcm14449 dwhlab          0 Feb 13 13:43 *_BR.o2001728
-rw-------. 1 hcm14449 dwhlab       3420 Feb 13 13:43 *_BR.e2001728
-r-xr-xr-x. 1 hcm14449 dwhlab 1056087279 Feb 13 14:31 HM-H0-11_R1_001.fastq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 1087724459 Feb 13 14:31 HM-H0-11_R2_001.fastq.gz

Are you free now (Thurs) before 4 or Mon, tomorrow (Fri) 1-4 or Mon 10-4? I think we need to go over setting up a new data archive just for the original run /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/ and coordinate about which bam files should be used in the project.

hollygene commented 4 years ago

Here's an outline of the sequencing data I have:

Ancestors (Spike-In Data)

/project/dwhlab/Holly/TE_MA_Paradoxus/Paradoxus_MA/Anc_SpikeIns/Holly_gDNA
[hcm14449@xfer3 Holly_gDNA]$ ls -lrt
total 158432
-r-xr-xr-x. 1 hcm14449 dwhlab 34210410 May 23  2019 HM_D20_S15_R1_001.fastq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 28470836 May 23  2019 HMM_D0_S13_R1_001.fastq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 31904023 May 23  2019 HMM_D1_S14_R1_001.fastq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 33984257 May 23  2019 HM_H0_S16_R1_001.fastq.gz

First Run of Sample Data (Sep 2019)

/project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3

Missing samples from this data: H0: 3, 6, 25 D0: 8, 28, 29, 41, 45 D1: 1, 23, 34, 43, 47, 48 D20: 7

Resequenced Samples (Jan 2020):

/project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/Holly

Duplicate Samples: H0: A, 3, 4, 7, 13, 14 D0: A, 3, 30, 31, 34 D1: A, 6, 21, 22 D20: A, 44

Still Missing: H0: 6 D0: 8, 41 D1: 47 D20: 7

Oh, also - the reason I had new files in the /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/H0 directory was that I found I was missing two samples (H0 10 and H0 11) from that folder that were actually in the /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3 directory

I used cp to copy those two from /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3 to /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/H0 and rsync to sync that with /scratch/ (Hope that was okay to do)

I just now made /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/H0 read-only again

[hcm14449@xfer3 00_fastq]$ chmod a-w H0
[hcm14449@xfer3 00_fastq]$ ls -lrta
total 1024
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb  7 12:19 D20
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb  7 13:25 D0
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb  7 13:30 D1
dr-xr-xr-x. 3 hcm14449 dwhlab 53248 Feb  7 16:06 ..
dr-xr-xr-x. 6 hcm14449 dwhlab 53248 Feb  7 16:06 .
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb 13 14:31 H0

JingxuanChen7 commented 4 years ago

I visualize TE locations using the output of TElocate in D20 replicates. Here are few samples that may support the mislabeling.
Chr 5
Chr 7
Chr 10
All 3 positions show that HM-D20-4 and D20-A do not have insertion where it should have.
- *
For D1 replicates, I find it is really hard to find locations like above. This location in Chr 10 might be the insertion during experiment?
I would point out D1-A here because it does not have TE where it should have, but has a TE in similar location with D20 sample (See Chr 10 in D20). Copy number profile also indicates a much higher level than the other D1.

Note: I find one job failed (HM-D1-45) for TE-locate probably due to the extremely high genome depth(3340.05). But coverage module finished successfully.

hollygene commented 4 years ago

Thanks @JingxuanChen7!

For D1 samples, I expect some level of transposition so I wouldn't be surprised if there were more than 1 Ty1 in the progenitors - however, the ancestor should only have 1
The Ty1 in D1 is on ChrX between RAD7 and CDC8 and next to a Gly tRNA gene
I don't know the locations of D20's Ty1s, I just know that there are approximately 20 of them

From our original spike-in data of the ancestors, it looks like D1 and D20 were mislabeled. H0 and D0 looked correct, but we should probably still make sure they are haploid and diploid, respectively.

ALL of the FASTA files are now in this directory:

/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/AllFastas

cbergman commented 4 years ago

Thanks @hollygene, the new data archive looks good. Timestamps make sense and all files are read-only, which is good. Prefereably keep this folder as is, or alert us asap if you amke any changes.
@JingxuanChen7, can you re-run your McC analysis and keep bam files for the new data archive. This will allow us to compare coverage and TE location results for samples run on both dates.
Also, @JingxuanChen7 can you see if RAD7 and CDC8 are the flanking genes for the Ty1 insertion on Chr10 you see in the D1 series?
Could one of you also put together a table of the number of reads/read pairs and estimated depth of coverage for the single copy portion of the genome for all samples. We may need to down-sample excessively high coverage samples

JingxuanChen7 commented 4 years ago

New McC run submitted.
I load my annotation and I zoom in to ChrX where Ty1 exist in D1 samples, the same region as the screenshot shown here https://github.com/hollygene/TE_MA/issues/2#issuecomment-586494880 The gene annotation is consistent with experiment. I have no idea why the gene name for CDC8 does not show correctly in IGV, but the gene is actually well-annotated.

D1Ty1with_Anno

hollygene commented 4 years ago

Thanks!

Do we know if the mislabeled D20-A has the Ty1 in Chrom X like the D1 samples?

JingxuanChen7 commented 4 years ago

No unfortunately. It somewhere near this location. But not between RAD7 and CDC8 for sure.

JingxuanChen7 commented 4 years ago

But I think D1-A should be D20. I checked other locations, and they are consistent.

JingxuanChen7 commented 4 years ago

New McC run finished. The barplot is extremely large, so please zoom in in order to see sample names clearly.

all_internal

hollygene commented 4 years ago

Thank you @JingxuanChen7!

This is what it looks like to me:

HM-D20-4 is actually D0/H0 (TBD) HM-H0-33 is actually a D1 sample

D20-A and D1-A are swapped - D20-A is actually D1-A, and D1-A is actually D20-A.

HM-D20-A and HM-D0-A are swapped: HM-D20-A is actually HM-D0-A, and HM-D0-A is actually HM-D20 A

Does this new run include the masking of the MAT locus?

cbergman commented 4 years ago

@hollygene I agree with your interpretation of the potential sample swaps. I am a bit concerned that parental samples have been swapped in both the new and old runs. Is this issue severe enough to warrant resequencing the parental strains again?
This run of the pipeline did not use a modified version of the reference genome that masks the mating type locus. We still need to think about this experiment more and make detailed predictions about what the expected outcomes are.
In addition to modifying the reference genome to determine if the strains are diploid or haploid, I think we should also change the TE library used for estimating copy number. The Ty1-H3 reference element has part of Ty2 in its POL region, which is leading to a false positive signal of Ty2 in the data. I think the easiest way to do this is by eliminating Ty2 from the TE library (which is valid since Spar doesn't have Ty2).
@hollygene, have you run SNP calling on the BAM files produced by @JingxuanChen7 and compared them to the SNPs obtained from your BAM files? I think this is a necessary analysis to do so we can decide if one set of BAM files can be used for the whole project (or not).

hollygene / TE_MA

Paradoxus MA Information #2

McClintock coverage module for all 337 sequencing data finished