maggimars / Tara-Phaeo

0 stars 0 forks source link

Functional Gene Clustering #2

Closed maggimars closed 3 years ago

maggimars commented 3 years ago

protein level clustering with MMSeqs2 (faster) and/or OrthoFinder (more traditional)

halexand commented 3 years ago

+1 for OrthoFinder : this might be a good initial tutorial / best practices to read through: https://davidemms.github.io/menu/tutorials.html

maggimars commented 3 years ago

The tutorial was hugely helpful!

I had some installation issues on the HPC:

halexand commented 3 years ago

We definitely don't want you running this on your local machine :)

I have successfully built a conda env for orthofinder on the HPC. Sometimes when this happens I find that it will work if you define the install requirements when you build the environment. e.g.: conda create -n ortho -c bioconda orthofinder

Or, by creating a yaml file with the requirements and building from that.

Also, I just activated my environment. If it helps: I am using OrthoFinder version 2.3.12. Might be worth specifying that version in the install?

maggimars commented 3 years ago

Your trick worked! I ran conda create -n ortho -c bioconda orthofinder. The install finished and I was able to run orthofinder within the environment.

Running the source code also worked - it finished already and I have a directory full of results.

For the record - I installed v2.5.1

maggimars commented 3 years ago

I made an upset plot with the orthogroups.genecount output file from OrthoFinder. Relatively few orthogroups are present in all of the genome/transcriptomes. The thing that stands out the most is that the P. cordata RCC1383 transcriptome is very small and is missing many orthogroups that are present in all or most of the other genome/transcriptomes. RCC1383 is actually a synonym to CCMP3104, so maybe we should exclude this "strain" or combine it into the CCMP3104 transcriptome? It is not visible in this plot, but when I ordered the plot by frequency rather than degree it was apparent that a relatively high number (~300) of orthogroups were only represented in the P. antarctica genome/transcriptomes (all three) and not in any other species'. These may be an interesting batch of orthogroups to look at the functions for.

unnamed-chunk-3-1

halexand commented 3 years ago

Ah, love this kind of plot.

First thought, yes, totally agree. It seems that RCC1383 should be chucked-- without it you suddenly have ~1800 core genes which seems more on par with other groups. I don't think you should combine it... it complicates the story.

I have a set of 12 variably complete Phaeo MAGs. One is estimated to be 70% complete... might be worth including here: SAO-all-DCM-0-8-5-00_bin-181 (newly named TOPAZ_SAD1_E003). It is isolated from the Southern Atlantic DCM (0.8-5 um size fraction). The rest are much lower completeness, and I don't think it makes sense to include them.

I can send you the protein coding regions if you think it might be worth taking a look at.

halexand commented 3 years ago

See /vortexfs1/omics/alexander/share/formaggi for the genomic contigs, predicted coding sequences, and amino acid sequences :)

halexand commented 3 years ago

A bit more food for thought: this paper seems relevant: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0097801

In particular Figure 1 seems relevant to this section of the project. Notably, they don't detail what parameters they used in the clustering with orthomcl-- so no idea really how to compare it to what you are doing here. But, they identify ~3k core proteins across the groups surveyed.

image

halexand commented 3 years ago

Also, I found this paper interesting: https://www.ncbi.nlm.nih.gov/books/NBK558824/. Not that we are necessarily dealing with a pangenome situation but rather that some of the open v closed concepts are interesting. I also think the section on chlorophytes might be broadly relevant.