hputnam / Becker_E5

3 stars 0 forks source link

Check that each gene name from the Annotation file is found in your gene counts matrix file and also in your gff3 file #14

Closed hputnam closed 4 months ago

hputnam commented 4 months ago

Your Annotation file should have 27,439 genes that were identified by blasting the protein fasta file, which clearly identifies the protein name and links it directly to the original genome paper. Therefore, you need to ensure that your gene counts matrix file gene ids match it for annotation of DEGs.

Also the gff3 file needs to match gene id exactly to calculate the correct gene lengths and to join the files for GO analysis

daniellembecker commented 4 months ago

Checked that gene count matrix file and gffs file match exactly for gene_ids in DESeq2 code to start off all analyses here, see lines 185-198 for comparison: DESeq2

However, looking back into my functional annotation assembly script, even though the output file has 27,439 unique genes there were still unmatched gene_ids between it and the gff3, so tracking some of the unmatched genes through the script (ex: "Pver_g408", "Pver_g3070", "Pver_novel_701_5de57afd") which show that the functional annotation file that was blasted to the original protein fasta file has naming issues. Currently looking into where the divide is occurring and where the pattern is.

hputnam commented 4 months ago

@daniellembecker check the code here for how I dealt with it: https://github.com/hputnam/Poc_RAPID/blob/main/RAnalysis/scripts/Embryo_GeneExpression.Rmd

daniellembecker commented 4 months ago

@hputnam I think your repo is private the link isn't working and I don't see it on you repo list

Screen Shot 2024-06-07 at 12 50 01 PM
hputnam commented 4 months ago

try again

daniellembecker commented 4 months ago

all now match from edits on line 188-229: code