Open nick-youngblut opened 4 years ago
Thank you for the suggestion. I've updated StrainPro to version v0.9.4. It will read the env variable (StrainPro_DIR=/path/StrainPro/bin) to find all StrainPro's tools. And I also change -dump to specify the taxonomy path (-dump /TaxonomyPath). Please let me know if there are any problems while running StrainPro. Thank you!
Thanks! I just tested both feature updates and they both work.
Just curious: why use StrainPro_DIR
for setting the path to the executables? Most users will likely add /their/path/to/StrainPro/bin/
to their PATH
, so StrainPro_DIR
would be redundant.
@nick-youngblut It's more convenient to have a variable indicating the path of StrainPro. If I only parse $PATH, I couldn't be sure users will use StrainPro as the directory name. In that case, StrainPro will not be able to locate the executable's path.
If I only parse $PATH, I couldn't be sure users will use StrainPro as the directory name
Why would you need to parse $PATH? Why not just see if StrainPro-rep
is in the PATH? For python this simply involves:
from distutils.spawn import find_executable
if find_executable('cat') is None:
raise OSError('"cat" executable is not in your PATH')
With v0.9.4
I'm getting the following error when running StrainPro-build
with -dump
:
Load taxonomy information.
Error! File: ./bin/scripts/taxonomy/nodes.dmp is not accesible
I'm using the directory path to the taxdump files, which is not ./bin/scripts/taxonomy/
I'll study how to find the path of a program in C/C++. In the latest version of StrainPro, "-dump" is followed by a dump file path. For example, run StrainPro with "-dump ./taxonomy", then StrainPro will look for nodes.dmp and merged.dmp in the folder of ./taxonomy. However, if you don't specify the path, it will look for the StrainPro_Root_Dir/taxonomy as default. For example, if your StrainPro is installed in /path/StrainPro, then the dmp file path will be /path/StrainPro/taxonomy.
if you don't specify the path, it will look for the StrainPro_Root_Dir/taxonomy as default.
but I did provide the path via -dump /path/to/the/taxdump/
. Is there a problem with using an absolute path?
I'll study how to find the path of a program in C/C++
The downside of using C/C++
: everything is harder than with python or R. One option would be to use whereami
@hsinnan75 I just wanted to ping you about the -dump /path/to/the/taxdump/
issue, as explained above. Any progress on fixing that issue?
Sorry, I forgot to fix this issue. I'll do it asap and let you know.
Hi, I just found a bug. I forgot to include "-dump
Thanks for the fix! That worked, but now striainpro-map stalls when I run it. Here's the command:
DUMPDIR=`/path/to/by/custom/taxdump/`
DBDIR=`/path/to/db/created/by/strainpro-build`
export StrainPro_DIR=`dirname ./bin/scripts/StrainPro/StrainPro-map`
./bin/scripts/StrainPro/StrainPro-map -t 8 -dump $DUMPDIR -i $DBDIR -o output.txt -f /path/to/read/file/file.fq.gz
I get the following messages within a few seconds, and then strainpro-map just stalls:
Load taxonomy information.
Load BWT index: /path/to/StrainPro/ref_idx/0 (1/2)
2000000 reads have been processed (1.20% mapped) in 6 seconds
Load BWT index: /path/to/StrainPro/ref_idx/1 (2/2)
180000 reads have been aligned in 1 seconds
It might be worth it to use continuous integration to help identify bugs in the StrainPro code. This is very simple with Travis-CI or Circle-CI. For travis-CI, you'd just need to create a .travis.yml
similar to https://github.com/nick-youngblut/gtdb_to_taxdump/blob/master/.travis.yml
I'll say the bug should be the identification of LCA for two taxIDs. If they don't have any common ancestor, the identification of LCA will stall. Could you tell me where I can download your taxonomy files and your test data? Thank you!
The taxonomy can be found at http://ftp.tue.mpg.de/ebio/projects/struo/GTDB_release89/taxdump/
Can you modify the code to either directly error-out if the LCA cannot be found or at least timeout after a certain amount of time searching for the LCA?
I just updated StrainPro. Please try again and I'll test with your taxonomy files.
I just updated StrainPro. Please try again and I'll test with your taxonomy files.
Do you mean that you added an error-out or time-out?
@nick-youngblut It was a bug in StrainPro-Map. It did not exit the LCA function call when it found a common ancestor. Sorry for the mistake.
Your else break;
change didn't seem to fix the issue. The strainpro-map jobs seem to still be stalling. Note: I did recreate the database via strainpro-build before retrying strainpro-map
You're right. I already checked the two taxids in the while loop. I did not pay attention. Could you give me your read file and database? I was wondering why it stalls. Thank you!
Let me know if that's not enough reads to reproduce the issue
Thank you. And I also need the reference sequences.
My refs: refs.fna.gz
Thank you! I've tried with your reference sequences and your read data. It works fine but it did not output any prediction. I think the reads are not enough to reproduce the issue.
Hi, I found the customized taxonomy dump file (nodes.dmp) is not complete. The taxid of the first genome in refs.fna is 310298 which cannot be found in nodes.dmp. There is no information for that taxid. Therefore, StrainPro-map stalls when it tries to find the LCA of 310298 and another taxid. Do you have any suggestions how do I handle such cases?
Thanks for finding the bug! I used the wrong taxdump file. Those taxIDs are sub-species that I added to a different taxdump.
It would probably be best if strainpro-map just threw in error if the taxdump file is not complete.
Thank you! I've updated StrainPro to v0.9.7. It will check the completeness of the taxonomy dump files and check if every taxid in the reference database could be found in the dump files.
Thanks! That worked. A couple of notes about the output table format:
TaxID #Read_count #Depth #Relative_abundance #Confidence_score
#
is considered a comment character by many parsers, so this could cause some problems for users to read in the table (with column names)
@TaxRank:subspecies
@TaxRank:species
821 82 8 22.16 0.082000
853 288 23 77.84 0.235487
@TaxRank:genus
[...]
Instead of the @TaxRank:
lines, why not just have a column named "TaxRank"? This would make the table a standard tab-delimited table that would be much easier to parse. Users can then easily parse the table by tax rank via tidyverse (dplyr) or other standard means.
Dear Dr. Youngblut,
That was a very good suggestion. I'll modify the output format to meet the standard tab-delimited table.
Best regards,
Hsin-Nan
On Mon, Feb 17, 2020 at 1:34 AM Nick Youngblut notifications@github.com wrote:
Thanks! That worked. A couple of notes about the output table format:
TaxID #Read_count #Depth #Relative_abundance #Confidence_score
is considered a comment character by many parsers, so this could cause
some problems for users to read in the table (with column names)
@TaxRank:subspecies @TaxRank:species 821 82 8 22.16 0.082000 853 288 23 77.84 0.235487 @TaxRank:genus [...]
Instead of the @TaxRank: lines, why not just have a column named "TaxRank"? This would make the table a standard tab-delimited table that would be much easier to parse. Users can then easily parse the table by tax rank via tidyverse (dplyr) or other standard means.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hsinnan75/StrainPro/issues/5?email_source=notifications&email_token=ACOKJKON3VAVPQ2WCOCCSKDRDF2KXA5CNFSM4KN3BLIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL4ND2Y#issuecomment-586732011, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOKJKJ4OXKDEFJNSDOTGRTRDF2KXANCNFSM4KN3BLIA .
I've modified the output format. Please update StrainPro to v0.9.8. Thank you!
Thanks for modifying the format! While the new output will parse as a tsv, it is technically tab+space delimited, since you are using spaces to align columns. The user will have to strip off all spaces from the ends of the string-based values in each column (and then convert from character to numeric as necessary). This will cause confusion from a lot of users
Thanks for the suggestion. I've update StrainPro (still version 0.9.8). The output format is tsv.
Thanks!
One general question about the output: how is one supposed to interpret the confidence scores? I don't see any mention of confidence scores in the bioRxiv paper, and the README only states:
Confidence_score is the confidence score of that prediction.
What is a "good" confidence score? What confidence score cutoff was used for all of your evaluations in the bioRxiv paper?
For StrainPro v0.9.2, it appears that when calling
StrainPro-build
, that executable requiresStrainPro-rep
to be located atbin/StrainPro-rep
relative to the user's current working directory. Thus the user is required to always callStrainPro-build
from the base StrainPro directory.Why not just provide instructions for adding
/path/to/clone/of/strainpro/StrainPro/bin/
to the user's directory and remove this hard-coded path requirement forbin/StrainPro-rep
inbin/StrainPro-build
? That is what thePATH
env variable is for, and it would makebin/StrainPro-build
a lot more flexible.