jsoghigian / orthoset_construction

Tools and resources for the construction of ortholog sets usable with the software Orthograph
GNU General Public License v3.0
1 stars 0 forks source link

Inquiry about orthoset_construction error! #3

Open woojunbang opened 5 months ago

woojunbang commented 5 months ago

Hello. This is Woo Jun Bang, and I'm trying to use this program for building orthograph reference data set!

I got this error when I wrote 'sh ortho_dl.sh culicidae 7157 0.9 0.9' for test.

Then I got the message as below:

sh ortho_dl.sh culicidae 7157 0.9 0.9 /Users/woojunbang/program/orthoset_construction % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1201k 0 1201k 0 0 213k 0 --:--:-- 0:00:05 --:--:-- 253k Bad RequestBad RequestBad RequestBad Request

(Keep going...this error message)

and same message for 'sh ortho_dl.sh coloeoptera 7041 0.8 0.8'

So I checked the URL code by the site "https://www.ezlab.org/orthodb_userguide.html#standalone-orthologer-software".

However, there are no exact differences with your "ortho_dl.sh" script for orthodb.V10.

This is "ortho_dl.sh". I only changed the line 35 to fix this error below:

/Users/woojunbang/program/orthoset_construction % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1417k 0 1417k 0 0 220k 0 --:--:-- 0:00:06 --:--:-- 290k ortho_dl.sh: line 35: unexpected EOF while looking for matching `'' ortho_dl.sh: line 39: syntax error: unexpected end of file

!/bin/bash

SCRIPTPATH="$( cd "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )" echo $SCRIPTPATH

This script is part of the ortholog construction pipeline written by J. Soghigian.

It was last edited on 2023-02-28

This script will download all the fasta files belonging to a taxonomic level that meets user-specified thresholds of inclusion for universality and single-copy nature.

Usage:

sh ortho_dl.sh prefix_name level universality level_of_single_copy

Prefix name refers to the prefix used in downloading and construction of folders, and can be a name of the taxonomic group of group, e.g. Diptera. Level refers to the internal taxonomic identifier used by orthodb, e.g. 7147 for Diptera. Universailty is the presence of the ortholog across the genomes on orthodb at that taxonomic level, and level of single copy refers to orthologs that are single copy in only that percentage of genomes. In other words,

sh ortho_dl.sh diptera 7147 0.9 0.9

Will include only orthologs found in 90% of the Diptera genomes, and of those, we include only orthologs that are single copy in at least 90% of genomes. It is important to note that this WILL download duplicated genes (we take care of them later). The set will have the prefix diptera.

First, we will define some variables. We will start with the ortholog database prefix. This is the ortholog set name we might use later for e.g. Orthograph, but the exact name is arbitrary.

ogprefix=$1

We will now use wget to download a list of fasta file IDs for a given taxonomic level (level=7147) and species/set of species (7147). Consider adjusting the universal/single copy settings as desired - here, universality (presence in genomes) is set to 0.9, and threshold for single copy is also set to 0.9. This means that of all the genomes at this taxonomic level, we include only orthologs found in 90% of the genomes, and of those, we include only orthologs that are single copy in at least 90% of genomes. It is important to note that this WILL download duplicated genes (we take care of them later).

note if you are targetting a set of orthologs >10k, you'll need to adjust the limit we set as well.

level=$2 uni=$3 sc=$4

curl -o ${ogprefix}.uni0.9single0.9.fasta "https://data.orthodb.org/current/search?query=&level=${level}&species=${level}&universal=${uni}&singlecopy=${sc}&take=100000"

We will now process this file so that it can be fed into a loop. This will allow us to download each fasta file individually for each ortholog.

cat ${ogprefix}.uni0.9single0.9.fasta | awk -F"[" '{print $2}' | awk -F"]" '{print $1}' | sed 's/"//g' | perl -pe 's/, /\n/g' > ${ogprefix}.listoffasta

This is now a file that contains OrthoDB IDs for orthologs at a given taxonomic level. E.g., 10359at7203 is Orthogroup 10359 at taxononimc level 7203. This list corresponds to the specifications we used in the wget expression above; e.g., the orthogroups contained herein are present in 90% of genomes at that taxonimic level and 90% single copy in those genomes. So with this identifier, we can now download this orthogroup as a fasta file.

to begin we create a folder to store these orthologs

mkdir ${ogprefix}_orthologs

now we loop over the aforementioned list of fasta file and download each orthogroup's fasta file. Note that this URL may change as orthoDB changes their URLs. Consult orthoDB for more information.

for line in cat ${ogprefix}.listoffasta; do curl 'https://data.orthodb.org/current/fasta?id='${line}' -o ${ogprefix}_orthologs/${line}.fasta'; sleep 2; done

rm ${ogprefix}.listoffasta rm ${ogprefix}.uni0.9single0.9.fasta

I'm inquiring about how to resolve the following issue!

By the way, Your phylogenomics paper has been incredibly insightful, and it has greatly aided for me!

Thank you.

jsoghigian commented 5 months ago

Hi Woo Jun Bang,

Thanks for the report - I'll take a look at the scripts and see what needs to be updated / fixed and get back to you.

And thanks for the kind comment on the paper - glad it's been helpful!

jsoghigian commented 4 months ago

Hey Woo Jun, sorry for the slow response here - I had quite a bit of grading to get done. I think I addressed this issue with today's update to ortho_dl.sh . I ran the pipeline with sh ortho_dl diptera 1.0. 1.0 for speed, and both scripts worked. I'm running your exact culicidae code and it is currently processing the orthologs, but appears to be working right based on preliminary output. Could you check today's code updates and let me know if it is working for you? :)