hollygene / TE_MA

S. paradoxus TE MA experiment
0 stars 0 forks source link

Paradoxus MA Information #2

Open hollygene opened 4 years ago

hollygene commented 4 years ago

Whole-genome sequencing of S. paradoxus mutation accumulation lines DNA extracted using Zymo YeaSTAR Genomic DNA kit Libraries prepped using homemade protocol (no kit, can find out more info if needed, probably published) Sequenced on NovaSeq S4 flow cell PE 150 at Genewiz 500GB total sequencing data

Location of fastq files on Sapelo: /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq (I also have backups of this on the /project/ folder for our lab, and on an external hard drive)

So one potential problem I have found is that my samples have varying depth of coverage - I guess because they were pooled. My main concern is that my ancestral strains have ~25x coverage, while the progenitor lines have anywhere from 150x-2000x(!).

cbergman commented 4 years ago
cbergman commented 4 years ago
cbergman commented 4 years ago
cbergman commented 4 years ago

I've made reasonable progress on polishing the 337 genome, but I stalled out and have handed this over to Jingxuan. Can you please invite @JingxuanChen7 to this issue? Thanks!

hollygene commented 4 years ago

Absolutely!

I should be receiving new (higher read coverage) data for the ancestors hopefully by the end of this week. I will let you know when I get it and I will give you the path on sapelo to it!

Thanks!

cbergman commented 4 years ago
cbergman commented 4 years ago
hollygene commented 4 years ago

@cbergman @JingxuanChen7 Thank you so much! If you could do the mtDNA gene annotation as well, that would be great. The rest aren't particularly useful for this project.

JingxuanChen7 commented 4 years ago

The genome after scaffolding and removing short contigs here: /scratch/jc33471/pilon/337/annotation/genome.337.fasta

cbergman commented 4 years ago
JingxuanChen7 commented 4 years ago
cbergman commented 4 years ago
cbergman commented 4 years ago
cbergman commented 4 years ago
cbergman commented 4 years ago
hollygene commented 4 years ago

READ1="/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/HM-H0-A_R1_001.fastq" READ2="/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/HM-H0-A_R2_001.fastq" These files are still in the same directory.

The rest of my raw data is also currently in this directory, but I am going to change that tomorrow and will post the new locations and directory structure!

JingxuanChen7 commented 4 years ago
hollygene commented 4 years ago

This was so fast, thank you Jingxuan! @cbergman Does this mean these files are ready, or should I wait to align?

Thanks!

cbergman commented 4 years ago
hollygene commented 4 years ago

Awesome!

I tried viewing them with less but it says permission denied

cbergman commented 4 years ago

Ok, it looks like a permissions issue on /work. I don't see an easy fix to this and will definitely need to work with GACRC to find a solution (something along the lines of you joining our unix group). If you want to get started ASAP. Use the files in @JingxuanChen7's directories. They are identical:

[cbergman@sapelo2] $ diff /scratch/jc33471/pilon/337/annotation/genome.337.fasta /work/cmblab/cbergman/337/annotation/genome.337.fasta
[cbergman@sapelo2] $ diff /scratch/jc33471/pilon/337/annotation/final_output/337.centromere.gff3 /work/cmblab/cbergman/337/annotation/final_output/337.centromere.gff3
[cbergman@sapelo2] $ diff /scratch/jc33471/pilon/337/annotation/final_output/337.mitochondrial_gene.updated.gff3 /work/cmblab/cbergman/337/annotation/final_output/337.mitochondrial_gene.updated.gff3
[cbergman@sapelo2] $ diff /scratch/jc33471/pilon/337/annotation/final_output/337.nuclear_gene.updated.gff3 /work/cmblab/cbergman/337/annotation/final_output/337.nuclear_gene.updated.gff3
JingxuanChen7 commented 4 years ago
hollygene commented 4 years ago

I'm not positive I'm using the right commands to test this, but I'm still getting permission denied:

hcm14449@sapelo2-sub2 ~$ ls /scratch/jc33471/pilon/337/annotation/
ls: cannot access /scratch/jc33471/pilon/337/annotation/: Permission denied

hcm14449@sapelo2-sub2 ~$ less /scratch/jc33471/pilon/337/annotation/genome.337.fasta
/scratch/jc33471/pilon/337/annotation/genome.337.fasta: Permission denied
cbergman commented 4 years ago
JingxuanChen7 commented 4 years ago
hollygene commented 4 years ago

Awesome, this works, thank you!

cbergman commented 4 years ago
hollygene commented 4 years ago

Hi, yes sorry to take so long! ALL of the fastq files are in this directory: /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/ They are also separated into each strain: /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/H0 /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/D0 /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/D1 /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/D20

I realize this is really redundant, but I was kind of worried about accidentally deleting some of them so I have copied them in a ton of places.. (and I have backups on external drives)

The individual strain directories contain all of the files that I will be using in my final analysis. The ../00_fastq/ directory contains all of the data I have - including samples that have been resequenced.

I hope that makes sense! I probably will need to make these readable to you guys, I'll try to do that now but if it doesn't work let me know!

The sample names are in the file names. Thank you so much!!

hollygene commented 4 years ago

In case giving y'all access to the /project/ folder isn't possible, these files are also in my /scratch/ directory: /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/H0 /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/D0 /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/D1 /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/D20

JingxuanChen7 commented 4 years ago
cbergman commented 4 years ago
hollygene commented 4 years ago

Okay, that sounds great! I can come down today around 1 if that still works for you!

hollygene commented 4 years ago

@cbergman @JingxuanChen7

It looks like the gzipping is all done! I'm not currently in my lab, but when I get back to my notes I will start on the next step and getting them transferred to /scratch/.

cbergman commented 4 years ago
hollygene commented 4 years ago

Successfully got everything read-only:

total 1024
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb  7 12:19 D20
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb  7 13:21 H0
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb  7 13:25 D0
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb  7 13:30 D1
dr-xr-xr-x. 3 hcm14449 dwhlab 53248 Feb  7 16:06 ..
dr-xr-xr-x. 6 hcm14449 dwhlab 53248 Feb  7 16:06 .
[hcm14449@xfer3 00_fastq]$ pwd
/project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq
[hcm14449@xfer3 00_fastq]$
Also successfully rsync'd to /scratch:
 [hcm14449@xfer3 00_fastq]$ ls -lrta
total 56
dr-xr-xr-x. 2 hcm14449 dwhlab 12288 Feb  7 12:19 D20
dr-xr-xr-x. 2 hcm14449 dwhlab 12288 Feb  7 13:21 H0
dr-xr-xr-x. 2 hcm14449 dwhlab 12288 Feb  7 13:25 D0
dr-xr-xr-x. 2 hcm14449 dwhlab 12288 Feb  7 13:30 D1
drwxr-xr-x. 3 hcm14449 dwhlab  4096 Feb  7 16:02 ..
dr-xr-xr-x. 6 hcm14449 dwhlab  4096 Feb  7 16:06 .
[hcm14449@xfer3 00_fastq]$ pwd
/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq
[hcm14449@xfer3 00_fastq]$ 

i.e.

[hcm14449@xfer3 00_fastq]$ ls -lrta D20
total 125869148
-r-xr-xr-x. 1 hcm14449 dwhlab  572729373 Feb  5 11:13 D20-44_R1.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab  601673638 Feb  5 11:13 D20-44_R2.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab  518434568 Feb  5 11:13 D20-A_R1.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab  540250860 Feb  5 11:14 D20-A_R2.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 1997108912 Feb  5 11:27 HM-D20-10_R1_001.fastq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 2040745262 Feb  5 11:27 HM-D20-10_R2_001.fastq.gz

I think you guys should have read access to these, but let me know if something doesn't work! Thanks!!

JingxuanChen7 commented 4 years ago

McClintock coverage module for all 337 sequencing data finished

JingxuanChen7 commented 4 years ago
hollygene commented 4 years ago

Awesome, thank you @JingxuanChen7!

It looks like D20-A is actually D1, D1-A is actually D20, and D0-A is actually D1?

Is that what it looks like to y'all?

cbergman commented 4 years ago
hollygene commented 4 years ago

Here's an outline of the sequencing data I have:

Ancestors (Spike-In Data)

/project/dwhlab/Holly/TE_MA_Paradoxus/Paradoxus_MA/Anc_SpikeIns/Holly_gDNA
[hcm14449@xfer3 Holly_gDNA]$ ls -lrt
total 158432
-r-xr-xr-x. 1 hcm14449 dwhlab 34210410 May 23  2019 HM_D20_S15_R1_001.fastq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 28470836 May 23  2019 HMM_D0_S13_R1_001.fastq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 31904023 May 23  2019 HMM_D1_S14_R1_001.fastq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 33984257 May 23  2019 HM_H0_S16_R1_001.fastq.gz

First Run of Sample Data (Sep 2019)

/project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3

Missing samples from this data: H0: 3, 6, 25 D0: 8, 28, 29, 41, 45 D1: 1, 23, 34, 43, 47, 48 D20: 7

Resequenced Samples (Jan 2020):

/project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/Holly

Duplicate Samples: H0: A, 3, 4, 7, 13, 14 D0: A, 3, 30, 31, 34 D1: A, 6, 21, 22 D20: A, 44

Still Missing: H0: 6 D0: 8, 41 D1: 47 D20: 7

Oh, also - the reason I had new files in the /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/H0 directory was that I found I was missing two samples (H0 10 and H0 11) from that folder that were actually in the /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3 directory

I used cp to copy those two from /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3 to /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/H0 and rsync to sync that with /scratch/ (Hope that was okay to do)

I just now made /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/H0 read-only again

[hcm14449@xfer3 00_fastq]$ chmod a-w H0
[hcm14449@xfer3 00_fastq]$ ls -lrta
total 1024
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb  7 12:19 D20
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb  7 13:25 D0
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb  7 13:30 D1
dr-xr-xr-x. 3 hcm14449 dwhlab 53248 Feb  7 16:06 ..
dr-xr-xr-x. 6 hcm14449 dwhlab 53248 Feb  7 16:06 .
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb 13 14:31 H0
JingxuanChen7 commented 4 years ago

Note: I find one job failed (HM-D1-45) for TE-locate probably due to the extremely high genome depth(3340.05). But coverage module finished successfully.

hollygene commented 4 years ago

Thanks @JingxuanChen7!

From our original spike-in data of the ancestors, it looks like D1 and D20 were mislabeled. H0 and D0 looked correct, but we should probably still make sure they are haploid and diploid, respectively.

ALL of the FASTA files are now in this directory:

/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/AllFastas
cbergman commented 4 years ago
JingxuanChen7 commented 4 years ago

D1Ty1with_Anno

hollygene commented 4 years ago

Thanks!

Do we know if the mislabeled D20-A has the Ty1 in Chrom X like the D1 samples?

JingxuanChen7 commented 4 years ago
JingxuanChen7 commented 4 years ago
JingxuanChen7 commented 4 years ago

all_internal

hollygene commented 4 years ago

Thank you @JingxuanChen7!

This is what it looks like to me:

HM-D20-4 is actually D0/H0 (TBD) HM-H0-33 is actually a D1 sample

D20-A and D1-A are swapped - D20-A is actually D1-A, and D1-A is actually D20-A.

HM-D20-A and HM-D0-A are swapped: HM-D20-A is actually HM-D0-A, and HM-D0-A is actually HM-D20 A

Does this new run include the masking of the MAT locus?

cbergman commented 4 years ago