harmslab / topiary

Python framework for doing ancestral sequence reconstruction
MIT License
31 stars 7 forks source link

Problem with pulling proteome from NCBI--solution found! #44

Open ani-sch opened 3 months ago

ani-sch commented 3 months ago

Hello! Thank you for this great software and your time!

### EDIT/UPDATE 2--solved! We found a solution to this problem! You can build a local BLAST database that you specifically use for the reciprocal BLAST step--there are various references/guides for doing this throughout the topiary docs and github, but you may have to dig a little. First, find the proteome files of the species in your seed dataframe and download them. I've been able to find the protein.faa.gz files by searching the NCBI datasets site https://ncbi.nlm.nih.gov/datasets/. Then, build your local database using the makeblastdb function (run with --help argument if needed. More info online as well). I had the best luck using cat to combine files first, then using the combo file as the input. Once set up, start the pipeline: run the topiary-seed-to-alignment function and include the --local_recip_blast_db /path/to/databasename.faa argument. Running topiary-seed-to-alignment --help is helpful for setting this up. Here are a few links that contain relevant/helpful info:

-topiary.ncbi.blast.recip API reference: https://topiary-asr.readthedocs.io/en/latest/topiary.ncbi.blast.html#module-topiary.ncbi.blast.recip -(you may need to copy/paste this one into your browser, sorry): https://github.com/harmslab/topiary/commit/468a6d72bbdb58a1d312f068feb8e02d9facfb34

### EDIT/UPDATE: In the docs, it says users can specify sources of sequences (using the --blast_xml, --ncbi_blast_db, and --local_blast.db). However, I can't tell if those options only apply to building the sequence dataset (before dong reciprocal BLAST)? Or, if you can use those options to build a database for the reciprocal BLAST step specifically? If the latter is possible, we think that could solve the problem, as we could build a database with the unretrievable proteomes...but we are unsure if it'd create a problem with building/limit the sequence dataset (pre-reciprocal BLAST)? ###

### original post: I am beginning an ASR project using this software, but am running into an issue in the seed-to-alignment phase. I have a seed-dataset and am able to run the first command in the pipeline. The BLAST query seems successful, but then after the Doing reciprocal BLAST part, I get errors (text file with error message attached). It seems like the location of the Homo sapiens proteome has changed--the error readout provides a full path link to where it thinks the proper file is, and when trying to follow it, you can't find the file.

my main question is: what is the best course of action in this situation? I really can't remove this species from my seed dataset (or set it to false) because it is a crucial species to include for my purposes. I'm assuming there's a way to get/upload to topiary the proper proteome, but I'm unsure of the best way to do that...

Thank you for your time and help! topiary-error-may16.txt

cfreye commented 3 months ago

I have been encountering the same issue. Any help would be greatly appreciated. Thank you!

ani-sch commented 3 months ago

I have been encountering the same issue. Any help would be greatly appreciated. Thank you!

I'm not sure if you are still having this problem, but we found a solution! I updated my original post outlining what we did. If any additional explanation would be helpful, let me know! :)

cfreye commented 3 months ago

@ani-sch this is very helpful thank you so much for sharing! I was able to resolve this issue on my end as well based on your suggestions.