Closed sr320 closed 7 years ago
Based on the goals listed in my repository, I will publish the following files as part of my class project:
.tab
file with genes differentially expressed in control vs. ocean acidification conditions and accompanying gene ontology information. Ideally, this file will also highlight variations in differential expression based on gonad sex. The .tab
file will be based off of the following information that I will also publish:
.tab
file with genes differentially expressed in control vs. ocean acidification conditions.tab
file with genes differentially expressed in male vs. female gonads.txt
file with best matches and gene ontology information for all acensions.png
that visually displays information in the .tab
file@sr320 what do you envision to be the sorts of files that should be published? Just raw data (or links to them) and the workflows used to generate subsequent files for analysis, or do you think the final project should include some of these analysis files?
For my final project of characterizing a Pacific Oyster proteome I plan to create the final output files:
My original goals were to:
1) Identify proteins and their functions in C. gigas proteome
2) Compare an oyster proteome to another bivalve- the geoduck
3) Draw conclusions about differential protein expression in oysters reared at 23C and 29C from 2015 MS/MS data.
My final files will mostly be iPyrad output files for the "data3" assembly that I ran. This is basically the third iteration of the iPyrad run that I ended up using as my final dataset.
.vcf
is large file with variant and read depth information for each base. I used it to derive the file data3-2.txt
which I used for the EpiRAD analysis.
.geno
is a matrix of alleles that I used for making MDS plots
.loci
is a big file that basically shows all the stacks and the actual bases
.phy
is a file type I did not use, but is basically a supermatrix of the .loci file
.snps.map
provides indexing information for all the loci and SNPs
.str
is a matrix of alleles in a format that is ideal for the program STRUCTURE, but I used it in an R package called adegenet
to do discriminant analysis of principal components, which is similar to what STRUCTURE does
Several of these files have unlinked SNP counterpart files denoted .u.
, e.g. .u.str
. I focused my analysis on just the unlinked SNPs, so the .u.geno
and .u.str
files are the ones I used for the ddRAD analysis.
Further information can be found here: http://ipyrad.readthedocs.io/output_formats.html
For my final product I plan to publish four files:
.bam
file which is the output from the alignment step of Bismark
mapping my sequence data to the reference genome. However, this particular file is more or less useless without the methylation extraction step of Bismark
..cov
file which is basically a text file created during the methylation extraction step of Bismark
that has information about whether or not cytosines are methylated, location on the chromosome/scaffold, and what percentage of the cytosines are methylated/unmethylated..bed
file which is another output from the methylation extraction step of Bismark
that I can open using a viewer (IGV) so that I can actually see my methylation data..tab
or .txt
file that I will make which will (hopefully) contain some interesting results from searching through the methylation data and using BLAST
to find some heavily methylated/unmethylated genes.If all goes well, my final product should include:
.scafSeq
file (fasta format) with a subset of DNA sequences on scaffolds >70k bp.tabular
file with results from blasting transcriptome against >70k scaffolds & merged with Uniprot data, indexed.gff
file with candidate transposable elements, identified via RepeatMasker, indexed.gff
file with candidate miRNA locations, identified using the miRBase hairpin sequences, indexed.gff
file with candidate CpG sites, located via Galaxy's EMBOSS fuzznuc online tool, indexed.gff
file with RNASeq expression reads, indexed.xml
file for IGV visualization FYI @sr320 this weekend/week I'm focusing on the RNASeq step, and am a little fuzzy on how to do this to completion, but am using your Oly project repo as guidance/template.
My final product will include:
.sam
file that contains my reference genome, built from a combination of stacks, bowtie, and blast. .genepop
file, a matrix of every individual's genotype at every locus that can later be used in GENEPOP to calculate linkage disequilibrium, estimate Nm, among other metrics. .sumstats.summary.tsv
file, which contains a summary of all the summary statistics for each population. This includes mean observed / expected heterozygosity across variable and all loci, as well as a measure of Fis. fst.tsv
file, which contains Fst calculations for each pair of populations. For my final product:
.fasta
file that contains the final assembled transcriptome using all of the sequencing files for all individuals.
.tabular
file that contains the final annotation for the transcriptome and corresponding go ontology
.tabular
file containing differentially expressed genes for different comparisons
.jpg
file that visualizes the overall differential expression
.jpg
file that visualizes gene ontology enrichment results (might be a table)
.md
file that has the methods and results section for publication
For my final product, I will produce:
.sam
file for my cleaned catalog/de novo reference genome.genepop
file which has genotypes for all loci of all individuals.sumstats.summary.tsv
file and fst.tsv
which will provide statistical estimates of Fst, Fis, etc, between cohorts (or "populations" here).genepop
file to visualize population differencesMy final filetypes will be:
.csv
files with lowest common ancestor metapeptide results.jpg
files with plots showing lcs results.fasta
files with predicted protein results..tab
with GO terms for predicted proteins.
For this week's
project-progress
list out the ultimate files you will publish as part of your class product in a bulleted list. Include a brief description of the files and indicate filetype.