laurahspencer / DuMOAR

0 stars 0 forks source link

Annotate transcriptome #44

Open laurahspencer opened 1 year ago

laurahspencer commented 1 year ago

Need Uniprot IDs in output for GO enrichment analysis

kubu4 commented 1 year ago

Where can I find the transcriptome?

laurahspencer commented 1 year ago

@ggoetznoaa can you get Sam the transcriptome so he can run it through is annotation pipeline?

ggoetznoaa commented 1 year ago

@laurahspencer I thought it was in the github repository? I see it in

results/salmon/transcriptome

Its the fasta.gz file.

laurahspencer commented 1 year ago

I don't see it in there (see screen shot)- perhaps you see it on your computer locally but couldn't upload it to GitHub b/c the file is too large (>100MB)?

image

ggoetznoaa commented 1 year ago

Ok, not sure exactly why it wasn't there, the file was only 35 MB. I did end up having to use Sedna to add the file, my laptop's git install is broken somehow (just had the laptop updated). I also got a warning saying the file was ignored because of .gitignore. Anyways, the file should be in that folder now.

kubu4 commented 1 year ago

@ggoetznoaa - @sr320 has encountered the issue with a Mac update breaking Git (and, he has to deal with it again, since the most recent update). This might save you some pain:

https://github.com/RobertsLab/resources/issues/360#issuecomment-417395799

kubu4 commented 1 year ago

TransDecoder/BLASTx/Trinotate annotation complete:

Notebook:

https://robertslab.github.io/sams-notebook/posts/2023/2023-11-09-Transcriptome-Annotation---M.magister-De-Novo-Transcriptome-Assembly-Using-Trinotate-on-Mox/#gene-ontology-go-annotations

Jump to the RESULTS to see the output files.

Three's a dedicated GO annotations file and then a full annotation file, which contains the results of all the various tools used for annotations (e.g. BLASTp, RNAmmer, pfam, BLASTx, hmmscan, etc).

And, if you're savvy, there's also a SQlite database that has all the results.

laurahspencer commented 9 months ago

@kubu4 I'm now using the annotation report you generated for the Dungeness crab transcriptome. I'm using it to perform functional/enrichment analyses of differentially expressed genes in DAVID using Uniprot Accession. I want to use the most comprehensive set of Uniprot Accessions possible, so want to make sure my approach makes sense

I see that you ran both blastx and blastp; genes with blastx hits have Uniprot Accessions in the annotation report, which I can use directly. Other genes without blastx hits do have blastp hits, and while the annotation report doesn't have Uniprot Accessions it does have Uniprot Entry Names which I can upload to Uniprot.com, pull Accession numbers, and add them to the annotation report (in R). Does this make sense? Do you have an easier way to get Uniprot Accessions for as many genes as possible?

Thanks!

kubu4 commented 9 months ago

I'll look into this.

~The shortcoming lies in the default blastp output format 6.~

~For some reason, the default set of columns differs in blastp output from other common BLAST default outputs (e.g. blastn and blastx both include subject IDs in their default outputs for format 6).~

I'll dive a bit into the Trinotate documentation (this is the software that creates that annotation report) and see if I can figure out whether a "customized" blastp output can be incorporated into the final annotation table (I suspect the answer is "yes").

If that's the case, then I'll just re-run blastp and incorporate the customized output format into a new version of that annotation table.

EDITED: Added strikethrough to incorrect info.

laurahspencer commented 9 months ago

cool, thanks for looking into it!

kubu4 commented 7 months ago

Sorry to take so long on this.

Anyway, I've figured out the issue and, possible, how to solve this.

The issue is caused by the BLASTp database which Trinotate uses. The peptide BLAST database it's using does not contain the SwissProt IDs in the source FastA header. Thus, SPIDs aren't used for generating the final annotation report file (since they aren't present).

I believe the solution will be to create the BLASTp data base myself, using the full Uni/SwissProt protein FastA. I've done a quick BLASTp test against this "custom" BLASTp database and the results contain the expeccted SwiissProt IDs in column 2 of the output file.

I'll try to tackle this soon and report back.