Open hollygene opened 5 years ago
4 original strains:
Original strains used to create 4 progenitor:
200 days of transfer, 1 transfer every other day
/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/HM-H0-A_R*
I've made reasonable progress on polishing the 337 genome, but I stalled out and have handed this over to Jingxuan. Can you please invite @JingxuanChen7 to this issue? Thanks!
Absolutely!
I should be receiving new (higher read coverage) data for the ancestors hopefully by the end of this week. I will let you know when I get it and I will give you the path on sapelo to it!
Thanks!
@cbergman @JingxuanChen7 Thank you so much! If you could do the mtDNA gene annotation as well, that would be great. The rest aren't particularly useful for this project.
The genome after scaffolding and removing short contigs here: /scratch/jc33471/pilon/337/annotation/genome.337.fasta
READ1="/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/HM-H0-A_R1_001.fastq"
READ2="/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/HM-H0-A_R2_001.fastq"
READ1="/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/HM-H0-A_R1_001.fastq" READ2="/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/HM-H0-A_R2_001.fastq"
These files are still in the same directory.
The rest of my raw data is also currently in this directory, but I am going to change that tomorrow and will post the new locations and directory structure!
/scratch/jc33471/pilon/337/annotation/genome.337.fasta
/scratch/jc33471/pilon/337/annotation/final_output
337.centromere.gff3
for centromere annotation, 337.mitochondrial_gene.updated.gff3
chrMT annotation, 337.nuclear_gene.updated.gff3
for nuclear gene annotation.This was so fast, thank you Jingxuan! @cbergman Does this mean these files are ready, or should I wait to align?
Thanks!
/work/cmblab/cbergman/337/annotation/genome.337.fasta
/work/cmblab/cbergman/337/annotation/final_output/337.centromere.gff3
/work/cmblab/cbergman/337/annotation/final_output/337.nuclear_gene.updated.gff3
/work/cmblab/cbergman/337/annotation/final_output/337.mitochondrial_gene.updated.gff3
Awesome!
I tried viewing them with less but it says permission denied
Ok, it looks like a permissions issue on /work. I don't see an easy fix to this and will definitely need to work with GACRC to find a solution (something along the lines of you joining our unix group). If you want to get started ASAP. Use the files in @JingxuanChen7's directories. They are identical:
[cbergman@sapelo2] $ diff /scratch/jc33471/pilon/337/annotation/genome.337.fasta /work/cmblab/cbergman/337/annotation/genome.337.fasta
[cbergman@sapelo2] $ diff /scratch/jc33471/pilon/337/annotation/final_output/337.centromere.gff3 /work/cmblab/cbergman/337/annotation/final_output/337.centromere.gff3
[cbergman@sapelo2] $ diff /scratch/jc33471/pilon/337/annotation/final_output/337.mitochondrial_gene.updated.gff3 /work/cmblab/cbergman/337/annotation/final_output/337.mitochondrial_gene.updated.gff3
[cbergman@sapelo2] $ diff /scratch/jc33471/pilon/337/annotation/final_output/337.nuclear_gene.updated.gff3 /work/cmblab/cbergman/337/annotation/final_output/337.nuclear_gene.updated.gff3
/scratch/jc33471/pilon/337/
dir read-only (chmod a-w -R /scratch/jc33471/pilon/337
)? Thanks!jc33471@sapelo2-sub1 ~$ ll /scratch/jc33471/pilon/337
total 108
dr-xr-xr-x. 9 jc33471 cmblab 4096 Jan 15 15:33 ./
drwxr-xr-x. 4 jc33471 cmblab 4096 Jan 24 15:35 ../
dr-xr-xr-x. 7 jc33471 cmblab 4096 Jan 24 12:56 annotation/
dr-xr-xr-x. 11 jc33471 cmblab 12288 Jan 15 16:25 annotation_test/
dr-xr-xr-x. 3 jc33471 cmblab 4096 Jan 15 15:31 data/
dr-xr-xr-x. 2 jc33471 cmblab 4096 Dec 20 14:07 mummer/
dr-xr-xr-x. 2 jc33471 cmblab 20480 Dec 20 12:08 pilon/
dr-xr-xr-x. 5 jc33471 cmblab 4096 Jan 2 14:40 ragoo/
dr-xr-xr-x. 19 jc33471 cmblab 16384 Jan 15 16:47 scripts/
I'm not positive I'm using the right commands to test this, but I'm still getting permission denied:
hcm14449@sapelo2-sub2 ~$ ls /scratch/jc33471/pilon/337/annotation/
ls: cannot access /scratch/jc33471/pilon/337/annotation/: Permission denied
hcm14449@sapelo2-sub2 ~$ less /scratch/jc33471/pilon/337/annotation/genome.337.fasta
/scratch/jc33471/pilon/337/annotation/genome.337.fasta: Permission denied
chmod a+r /scratch/jc33471/
chmod a+x /scratch/jc33471/
jc33471@sapelo2-sub2 jingxuan$ chmod a+r /scratch/jc33471
jc33471@sapelo2-sub2 jingxuan$ chmod a+x /scratch/jc33471
Awesome, this works, thank you!
Hi, yes sorry to take so long!
ALL of the fastq files are in this directory:
/project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/
They are also separated into each strain:
/project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/H0
/project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/D0
/project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/D1
/project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/D20
I realize this is really redundant, but I was kind of worried about accidentally deleting some of them so I have copied them in a ton of places.. (and I have backups on external drives)
The individual strain directories contain all of the files that I will be using in my final analysis. The ../00_fastq/ directory contains all of the data I have - including samples that have been resequenced.
I hope that makes sense! I probably will need to make these readable to you guys, I'll try to do that now but if it doesn't work let me know!
The sample names are in the file names. Thank you so much!!
In case giving y'all access to the /project/ folder isn't possible, these files are also in my /scratch/ directory:
/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/H0
/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/D0
/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/D1
/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/D20
/scratch
folders. Thanks!Okay, that sounds great! I can come down today around 1 if that still works for you!
@cbergman @JingxuanChen7
It looks like the gzipping is all done! I'm not currently in my lab, but when I get back to my notes I will start on the next step and getting them transferred to /scratch/.
chmod a-r -R /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/
rsync -av /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/ /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq
/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/
first (i.e. rm -rf /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/
) & make sure the trailing slash is on the sending path but not on the destination path. You might need to make the destination folder first (i.e. mkdir -p /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq
) before doing the rsyncSuccessfully got everything read-only:
total 1024
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb 7 12:19 D20
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb 7 13:21 H0
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb 7 13:25 D0
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb 7 13:30 D1
dr-xr-xr-x. 3 hcm14449 dwhlab 53248 Feb 7 16:06 ..
dr-xr-xr-x. 6 hcm14449 dwhlab 53248 Feb 7 16:06 .
[hcm14449@xfer3 00_fastq]$ pwd
/project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq
[hcm14449@xfer3 00_fastq]$
Also successfully rsync'd to /scratch:
[hcm14449@xfer3 00_fastq]$ ls -lrta
total 56
dr-xr-xr-x. 2 hcm14449 dwhlab 12288 Feb 7 12:19 D20
dr-xr-xr-x. 2 hcm14449 dwhlab 12288 Feb 7 13:21 H0
dr-xr-xr-x. 2 hcm14449 dwhlab 12288 Feb 7 13:25 D0
dr-xr-xr-x. 2 hcm14449 dwhlab 12288 Feb 7 13:30 D1
drwxr-xr-x. 3 hcm14449 dwhlab 4096 Feb 7 16:02 ..
dr-xr-xr-x. 6 hcm14449 dwhlab 4096 Feb 7 16:06 .
[hcm14449@xfer3 00_fastq]$ pwd
/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq
[hcm14449@xfer3 00_fastq]$
i.e.
[hcm14449@xfer3 00_fastq]$ ls -lrta D20
total 125869148
-r-xr-xr-x. 1 hcm14449 dwhlab 572729373 Feb 5 11:13 D20-44_R1.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 601673638 Feb 5 11:13 D20-44_R2.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 518434568 Feb 5 11:13 D20-A_R1.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 540250860 Feb 5 11:14 D20-A_R2.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 1997108912 Feb 5 11:27 HM-D20-10_R1_001.fastq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 2040745262 Feb 5 11:27 HM-D20-10_R2_001.fastq.gz
I think you guys should have read access to these, but let me know if something doesn't work! Thanks!!
/scratch/jc33471/paradoxusHolly/run0210_*
.tsv
files here https://github.com/bergmanlab/jingxuan/tree/master/data/paradoxus/337_outputsAwesome, thank you @JingxuanChen7!
It looks like D20-A is actually D1, D1-A is actually D20, and D0-A is actually D1?
Is that what it looks like to y'all?
/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/H0/
[cbergman@sapelo2] $ ls -lrt /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/H0/ | tail
-r-xr-xr-x. 1 hcm14449 dwhlab 481023003 Feb 5 11:24 H0-7_R2.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 626363773 Feb 5 11:25 H0-A_R1.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 668478601 Feb 5 11:25 H0-A_R2.fq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 2519083690 Feb 11 13:50 HM-H0-10_R1_001.fastq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 2582116750 Feb 11 13:50 HM-H0-10_R2_001.fastq.gz
-rw-r--r--. 1 hcm14449 dwhlab 582 Feb 13 13:27 *_BR.sh
-rw-------. 1 hcm14449 dwhlab 0 Feb 13 13:43 *_BR.o2001728
-rw-------. 1 hcm14449 dwhlab 3420 Feb 13 13:43 *_BR.e2001728
-r-xr-xr-x. 1 hcm14449 dwhlab 1056087279 Feb 13 14:31 HM-H0-11_R1_001.fastq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 1087724459 Feb 13 14:31 HM-H0-11_R2_001.fastq.gz
/project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/
and coordinate about which bam files should be used in the project.Here's an outline of the sequencing data I have:
Ancestors (Spike-In Data)
/project/dwhlab/Holly/TE_MA_Paradoxus/Paradoxus_MA/Anc_SpikeIns/Holly_gDNA
[hcm14449@xfer3 Holly_gDNA]$ ls -lrt
total 158432
-r-xr-xr-x. 1 hcm14449 dwhlab 34210410 May 23 2019 HM_D20_S15_R1_001.fastq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 28470836 May 23 2019 HMM_D0_S13_R1_001.fastq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 31904023 May 23 2019 HMM_D1_S14_R1_001.fastq.gz
-r-xr-xr-x. 1 hcm14449 dwhlab 33984257 May 23 2019 HM_H0_S16_R1_001.fastq.gz
First Run of Sample Data (Sep 2019)
/project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3
Missing samples from this data: H0: 3, 6, 25 D0: 8, 28, 29, 41, 45 D1: 1, 23, 34, 43, 47, 48 D20: 7
Resequenced Samples (Jan 2020):
/project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/Holly
Duplicate Samples: H0: A, 3, 4, 7, 13, 14 D0: A, 3, 30, 31, 34 D1: A, 6, 21, 22 D20: A, 44
Still Missing: H0: 6 D0: 8, 41 D1: 47 D20: 7
Oh, also - the reason I had new files in the /scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq/H0
directory was that I found I was missing two samples (H0 10 and H0 11) from that folder that were actually in the /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3
directory
I used cp to copy those two from /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3
to /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/H0
and rsync to sync that with /scratch/
(Hope that was okay to do)
I just now made /project/dwhlab/Holly/TE_MA_Paradoxus/Illumina_Data/GW_run3/00_fastq/H0
read-only again
[hcm14449@xfer3 00_fastq]$ chmod a-w H0
[hcm14449@xfer3 00_fastq]$ ls -lrta
total 1024
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb 7 12:19 D20
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb 7 13:25 D0
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb 7 13:30 D1
dr-xr-xr-x. 3 hcm14449 dwhlab 53248 Feb 7 16:06 ..
dr-xr-xr-x. 6 hcm14449 dwhlab 53248 Feb 7 16:06 .
dr-xr-xr-x. 2 hcm14449 dwhlab 16384 Feb 13 14:31 H0
I visualize TE locations using the output of TElocate in D20 replicates. Here are few samples that may support the mislabeling.
Chr 5
Chr 7
Chr 10
All 3 positions show that HM-D20-4 and D20-A do not have insertion where it should have.
For D1 replicates, I find it is really hard to find locations like above. This location in Chr 10 might be the insertion during experiment?
I would point out D1-A here because it does not have TE where it should have, but has a TE in similar location with D20 sample (See Chr 10 in D20). Copy number profile also indicates a much higher level than the other D1.
Note: I find one job failed (HM-D1-45) for TE-locate probably due to the extremely high genome depth(3340.05). But coverage module finished successfully.
Thanks @JingxuanChen7!
For D1 samples, I expect some level of transposition so I wouldn't be surprised if there were more than 1 Ty1 in the progenitors - however, the ancestor should only have 1
The Ty1 in D1 is on ChrX between RAD7 and CDC8 and next to a Gly tRNA gene
I don't know the locations of D20's Ty1s, I just know that there are approximately 20 of them
From our original spike-in data of the ancestors, it looks like D1 and D20 were mislabeled. H0 and D0 looked correct, but we should probably still make sure they are haploid and diploid, respectively.
ALL of the FASTA files are now in this directory:
/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/AllFastas
Thanks!
Do we know if the mislabeled D20-A has the Ty1 in Chrom X like the D1 samples?
Thank you @JingxuanChen7!
This is what it looks like to me:
HM-D20-4 is actually D0/H0 (TBD) HM-H0-33 is actually a D1 sample
D20-A and D1-A are swapped - D20-A is actually D1-A, and D1-A is actually D20-A.
HM-D20-A and HM-D0-A are swapped: HM-D20-A is actually HM-D0-A, and HM-D0-A is actually HM-D20 A
Does this new run include the masking of the MAT locus?
Whole-genome sequencing of S. paradoxus mutation accumulation lines DNA extracted using Zymo YeaSTAR Genomic DNA kit Libraries prepped using homemade protocol (no kit, can find out more info if needed, probably published) Sequenced on NovaSeq S4 flow cell PE 150 at Genewiz 500GB total sequencing data
Location of fastq files on Sapelo:
/scratch/hcm14449/TE_MA_Paradoxus/Illumina_Data/IL_Data/GW_run3/00_fastq
(I also have backups of this on the /project/ folder for our lab, and on an external hard drive)So one potential problem I have found is that my samples have varying depth of coverage - I guess because they were pooled. My main concern is that my ancestral strains have ~25x coverage, while the progenitor lines have anywhere from 150x-2000x(!).