hputnam / FROGER

2 stars 0 forks source link

Info on CpG O/E ratios in reseq data? #3

Open sr320 opened 5 years ago

sr320 commented 5 years ago

Methods

Resource Development: Canonical Genes

In order to get a fasta file of all genes in the oyster genome

bedtools getfasta \
-fi GCF_002022765.2_C_virginica-3.0_genomic.fa \
-bed ref_C_virginica-3.0_Gnomon_gene.gff3 \
> ../data/ref_C_virginica-3.0_Gnomon_gene.fa

This fasta file is available: https://d.pr/f/nfzK36 (400MB)

This gene level (genomic) fasta file was annotated.

blastx  \
-query ref_C_virginica-3.0_Gnomon_gene.fa \
-db uniprot_sprot_080917 \
-evalue 1E-05 \
-outfmt 6 \
-num_threads 28 \
-out Cv_gene_sprot.05.blastout

gsort -k1,1 -k11,11g Cv_gene_sprot.05.blastout \
| gsort -u -k1,1 --merge  > filtered.Cv_gene_sprot.05.blastout

blastout

For our purposes GO Slim information is desired and this was generated by joining blast output with UniProt tables.

The file with GO Slim Information

head Blastquery-GOslim.tab
NC_035786.1:35005823-35053811   GO:0000002  cell organization and biogenesis    P
NC_035784.1:81157412-81161455   GO:0000002  cell organization and biogenesis    P
NC_035785.1:6920706-6928297 GO:0000002  cell organization and biogenesis    P
NC_035788.1:25841487-25842908   GO:0000002  cell organization and biogenesis    P
NC_035788.1:84032938-84038358   GO:0000002  cell organization and biogenesis    P
NC_035788.1:84034120-84038279   GO:0000002  cell organization and biogenesis    P
NC_035788.1:65057880-65082026   GO:0000002  cell organization and biogenesis    P
NC_035781.1:55490481-55502256   GO:0000002  cell organization and biogenesis    P
NC_035782.1:1147125-1157055 GO:0000002  cell organization and biogenesis    P
NC_035784.1:87478898-87496133   GO:0000002  cell organization and biogenesis    P

Resource Development: Consensus Gene Sequence for 91 samples

Using Combined.SNP.TRSdp5g95FnDNAmaf05.vcf.gz (31GB) link separate VCF files were derived for each library. Details

Full genome sequences were generated using individual VCF files from each library

find Atumefaciens/20190103_Cvirginica_vcf_splitting/*vcf.gz \
| xargs basename -s .vcf.gz | xargs -I{} /bcftools consensus \
-f GCF_002022765.2_C_virginica-3.0_genomic.fa \
20190103_Cvirginica_vcf_splitting/{}.vcf.gz \
-o /Volumes/Serine/wd/19-01-08/{}.fa

Then grabbed gene level fasta files for all samples

find /Volumes/Serine/wd/19-01-08/*.fa \
| xargs basename -s .fa | xargs -I{} bedtools getfasta \
-fi /Volumes/Serine/wd/19-01-08/{}.fa \
-bed ref_C_virginica-3.0_Gnomon_gene.gff3 \
-fo /Volumes/Serine/wd/19-01-08/{}_GENE.fa

These 91 fasta files are available here. Both full genome {}.fa and gene {}_GENE.fa (jupyter notebook)


CpG Observed / Expected Ratio Calculations

This was determined for all genes for all 91 samples. With a single file with all data was created.