Closed maggimars closed 3 years ago
+1 for OrthoFinder : this might be a good initial tutorial / best practices to read through: https://davidemms.github.io/menu/tutorials.html
The tutorial was hugely helpful!
I had some installation issues on the HPC:
conda install -c bioconda orthofinder
led to a long list of package conflicts and ultimately a failed installationWe definitely don't want you running this on your local machine :)
I have successfully built a conda env for orthofinder
on the HPC. Sometimes when this happens I find that it will work if you define the install requirements when you build the environment. e.g.: conda create -n ortho -c bioconda orthofinder
Or, by creating a yaml file with the requirements and building from that.
Also, I just activated my environment. If it helps: I am using OrthoFinder version 2.3.12. Might be worth specifying that version in the install?
Your trick worked! I ran conda create -n ortho -c bioconda orthofinder
. The install finished and I was able to run orthofinder within the environment.
Running the source code also worked - it finished already and I have a directory full of results.
For the record - I installed v2.5.1
I made an upset plot with the orthogroups.genecount output file from OrthoFinder. Relatively few orthogroups are present in all of the genome/transcriptomes. The thing that stands out the most is that the P. cordata RCC1383 transcriptome is very small and is missing many orthogroups that are present in all or most of the other genome/transcriptomes. RCC1383 is actually a synonym to CCMP3104, so maybe we should exclude this "strain" or combine it into the CCMP3104 transcriptome? It is not visible in this plot, but when I ordered the plot by frequency rather than degree it was apparent that a relatively high number (~300) of orthogroups were only represented in the P. antarctica genome/transcriptomes (all three) and not in any other species'. These may be an interesting batch of orthogroups to look at the functions for.
Ah, love this kind of plot.
First thought, yes, totally agree. It seems that RCC1383 should be chucked-- without it you suddenly have ~1800 core genes which seems more on par with other groups. I don't think you should combine it... it complicates the story.
I have a set of 12 variably complete Phaeo MAGs. One is estimated to be 70% complete... might be worth including here: SAO-all-DCM-0-8-5-00_bin-181 (newly named TOPAZ_SAD1_E003). It is isolated from the Southern Atlantic DCM (0.8-5 um size fraction). The rest are much lower completeness, and I don't think it makes sense to include them.
I can send you the protein coding regions if you think it might be worth taking a look at.
See /vortexfs1/omics/alexander/share/formaggi
for the genomic contigs, predicted coding sequences, and amino acid sequences :)
A bit more food for thought: this paper seems relevant: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0097801
In particular Figure 1 seems relevant to this section of the project. Notably, they don't detail what parameters they used in the clustering with orthomcl-- so no idea really how to compare it to what you are doing here. But, they identify ~3k core proteins across the groups surveyed.
Also, I found this paper interesting: https://www.ncbi.nlm.nih.gov/books/NBK558824/. Not that we are necessarily dealing with a pangenome situation but rather that some of the open v closed concepts are interesting. I also think the section on chlorophytes might be broadly relevant.
protein level clustering with MMSeqs2 (faster) and/or OrthoFinder (more traditional)