steineggerlab / ufcg

UFCG: Universal Fungal Core Genes
https://ufcg.steineggerlab.com
GNU General Public License v3.0
29 stars 0 forks source link

how to change busco databases? #10

Open jungleblack007 opened 1 year ago

jungleblack007 commented 1 year ago

For example, I want to use the agaricales_odb10 as reference database to pick single copy orthologs, how can I change the Fungi_odb10 to Agaricales_odb10?

jungleblack007 commented 1 year ago

There is another question how to calculate GSI and label them to branches?

endixk commented 1 year ago

To use the agaricales_odb10 database you will have to download and process the ODB profiles into the form that UFCG pipeline can accept.

For this, please run the following commands on your system (this may take a while):

# Download and unzip the agaricales_odb10 database
wget -q "https://busco-data.ezlab.org/v4/data/lineages/agaricales_odb10.2020-08-05.tar.gz"
tar xzf agaricales_odb10.2020-08-05.tar.gz
gzip -d agaricales_odb10/refseq_db.faa.gz

# Prepare model and sequence databases for the UFCG pipeline
cd agaricales_odb10/
ls prfl/ | cut -d. -f1 > gene_list
sed -z 's/\n/,/g;s/,$/\n/' gene_list > gene_set
mkdir -p model/pro/ seq/pro/
cat gene_list | while read I; do cp prfl/$I.prfl model/pro/$I.hmm; grep -PA1 --no-group-separator "^>$I" refseq_db.faa > seq/pro/$I.fa; done

After running above, the following command will allow you to extract agaricales_odb10 set from your sequence(s):

ufcg profile --modelpath model/ --seqpath seq/ -s $(cat gene_set) -i /path/to/input -o /path/to/output <options> 
endixk commented 1 year ago

For the second question, output of the ufcg tree module includes a Newick file named concatenated_gsi_[N].nwk, which is the very tree labeled with GSIs that you are looking for. [N] will be the number of total genes that has been considered to calculate the indices.

jungleblack007 commented 1 year ago

wow, thank you for your detailed answer, it's so great! I am trying now.