hsinnan75 / StrainPro

MIT License
7 stars 3 forks source link

Cannot access file: bin/StrainPro-rep #5

Open nick-youngblut opened 4 years ago

nick-youngblut commented 4 years ago

For StrainPro v0.9.2, it appears that when calling StrainPro-build, that executable requires StrainPro-rep to be located at bin/StrainPro-rep relative to the user's current working directory. Thus the user is required to always call StrainPro-build from the base StrainPro directory.

Why not just provide instructions for adding /path/to/clone/of/strainpro/StrainPro/bin/ to the user's directory and remove this hard-coded path requirement for bin/StrainPro-rep in bin/StrainPro-build? That is what the PATH env variable is for, and it would make bin/StrainPro-build a lot more flexible.

hsinnan75 commented 4 years ago

Thank you for the suggestion. I've updated StrainPro to version v0.9.4. It will read the env variable (StrainPro_DIR=/path/StrainPro/bin) to find all StrainPro's tools. And I also change -dump to specify the taxonomy path (-dump /TaxonomyPath). Please let me know if there are any problems while running StrainPro. Thank you!

nick-youngblut commented 4 years ago

Thanks! I just tested both feature updates and they both work.

Just curious: why use StrainPro_DIR for setting the path to the executables? Most users will likely add /their/path/to/StrainPro/bin/ to their PATH, so StrainPro_DIR would be redundant.

hsinnan75 commented 4 years ago

@nick-youngblut It's more convenient to have a variable indicating the path of StrainPro. If I only parse $PATH, I couldn't be sure users will use StrainPro as the directory name. In that case, StrainPro will not be able to locate the executable's path.

nick-youngblut commented 4 years ago

If I only parse $PATH, I couldn't be sure users will use StrainPro as the directory name

Why would you need to parse $PATH? Why not just see if StrainPro-rep is in the PATH? For python this simply involves:

from distutils.spawn import find_executable
if find_executable('cat') is None:
    raise OSError('"cat" executable is not in your PATH')
nick-youngblut commented 4 years ago

With v0.9.4 I'm getting the following error when running StrainPro-build with -dump:

Load taxonomy information.
Error! File: ./bin/scripts/taxonomy/nodes.dmp is not accesible

I'm using the directory path to the taxdump files, which is not ./bin/scripts/taxonomy/

hsinnan75 commented 4 years ago

I'll study how to find the path of a program in C/C++. In the latest version of StrainPro, "-dump" is followed by a dump file path. For example, run StrainPro with "-dump ./taxonomy", then StrainPro will look for nodes.dmp and merged.dmp in the folder of ./taxonomy. However, if you don't specify the path, it will look for the StrainPro_Root_Dir/taxonomy as default. For example, if your StrainPro is installed in /path/StrainPro, then the dmp file path will be /path/StrainPro/taxonomy.

nick-youngblut commented 4 years ago

if you don't specify the path, it will look for the StrainPro_Root_Dir/taxonomy as default.

but I did provide the path via -dump /path/to/the/taxdump/. Is there a problem with using an absolute path?

I'll study how to find the path of a program in C/C++

The downside of using C/C++: everything is harder than with python or R. One option would be to use whereami

nick-youngblut commented 4 years ago

@hsinnan75 I just wanted to ping you about the -dump /path/to/the/taxdump/ issue, as explained above. Any progress on fixing that issue?

hsinnan75 commented 4 years ago

Sorry, I forgot to fix this issue. I'll do it asap and let you know.

hsinnan75 commented 4 years ago

Hi, I just found a bug. I forgot to include "-dump " when StrainPro-build asks StrainPro-rep to find representative sequences. Sorry for the delay again. And please show me the message if it does not work again. Thank you!

nick-youngblut commented 4 years ago

Thanks for the fix! That worked, but now striainpro-map stalls when I run it. Here's the command:

DUMPDIR=`/path/to/by/custom/taxdump/`
DBDIR=`/path/to/db/created/by/strainpro-build`
export StrainPro_DIR=`dirname ./bin/scripts/StrainPro/StrainPro-map`

./bin/scripts/StrainPro/StrainPro-map -t 8            -dump $DUMPDIR            -i $DBDIR -o output.txt   -f /path/to/read/file/file.fq.gz

I get the following messages within a few seconds, and then strainpro-map just stalls:

Load taxonomy information.
Load BWT index: /path/to/StrainPro/ref_idx/0 (1/2)
    2000000 reads have been processed (1.20% mapped) in 6 seconds
Load BWT index: /path/to/StrainPro/ref_idx/1 (2/2)
    180000 reads have been aligned in 1 seconds
nick-youngblut commented 4 years ago

It might be worth it to use continuous integration to help identify bugs in the StrainPro code. This is very simple with Travis-CI or Circle-CI. For travis-CI, you'd just need to create a .travis.yml similar to https://github.com/nick-youngblut/gtdb_to_taxdump/blob/master/.travis.yml

hsinnan75 commented 4 years ago

I'll say the bug should be the identification of LCA for two taxIDs. If they don't have any common ancestor, the identification of LCA will stall. Could you tell me where I can download your taxonomy files and your test data? Thank you!

nick-youngblut commented 4 years ago

The taxonomy can be found at http://ftp.tue.mpg.de/ebio/projects/struo/GTDB_release89/taxdump/

Can you modify the code to either directly error-out if the LCA cannot be found or at least timeout after a certain amount of time searching for the LCA?

hsinnan75 commented 4 years ago

I just updated StrainPro. Please try again and I'll test with your taxonomy files.

nick-youngblut commented 4 years ago

I just updated StrainPro. Please try again and I'll test with your taxonomy files.

Do you mean that you added an error-out or time-out?

hsinnan75 commented 4 years ago

@nick-youngblut It was a bug in StrainPro-Map. It did not exit the LCA function call when it found a common ancestor. Sorry for the mistake.

nick-youngblut commented 4 years ago

Your else break; change didn't seem to fix the issue. The strainpro-map jobs seem to still be stalling. Note: I did recreate the database via strainpro-build before retrying strainpro-map

hsinnan75 commented 4 years ago

You're right. I already checked the two taxids in the while loop. I did not pay attention. Could you give me your read file and database? I was wondering why it stalls. Thank you!

nick-youngblut commented 4 years ago

reads.fq.gz

Let me know if that's not enough reads to reproduce the issue

hsinnan75 commented 4 years ago

Thank you. And I also need the reference sequences.

nick-youngblut commented 4 years ago

My refs: refs.fna.gz

hsinnan75 commented 4 years ago

Thank you! I've tried with your reference sequences and your read data. It works fine but it did not output any prediction. I think the reads are not enough to reproduce the issue.

hsinnan75 commented 4 years ago

Hi, I found the customized taxonomy dump file (nodes.dmp) is not complete. The taxid of the first genome in refs.fna is 310298 which cannot be found in nodes.dmp. There is no information for that taxid. Therefore, StrainPro-map stalls when it tries to find the LCA of 310298 and another taxid. Do you have any suggestions how do I handle such cases?

nick-youngblut commented 4 years ago

Thanks for finding the bug! I used the wrong taxdump file. Those taxIDs are sub-species that I added to a different taxdump.

It would probably be best if strainpro-map just threw in error if the taxdump file is not complete.

hsinnan75 commented 4 years ago

Thank you! I've updated StrainPro to v0.9.7. It will check the completeness of the taxonomy dump files and check if every taxid in the reference database could be found in the dump files.

nick-youngblut commented 4 years ago

Thanks! That worked. A couple of notes about the output table format:

TaxID #Read_count #Depth #Relative_abundance #Confidence_score

# is considered a comment character by many parsers, so this could cause some problems for users to read in the table (with column names)

@TaxRank:subspecies
@TaxRank:species
821                     82                      8                       22.16                   0.082000
853                     288                     23                      77.84                   0.235487
@TaxRank:genus
[...]

Instead of the @TaxRank: lines, why not just have a column named "TaxRank"? This would make the table a standard tab-delimited table that would be much easier to parse. Users can then easily parse the table by tax rank via tidyverse (dplyr) or other standard means.

hsinnan75 commented 4 years ago

Dear Dr. Youngblut,

That was a very good suggestion. I'll modify the output format to meet the standard tab-delimited table.

Best regards,

Hsin-Nan

On Mon, Feb 17, 2020 at 1:34 AM Nick Youngblut notifications@github.com wrote:

Thanks! That worked. A couple of notes about the output table format:

TaxID #Read_count #Depth #Relative_abundance #Confidence_score

is considered a comment character by many parsers, so this could cause

some problems for users to read in the table (with column names)

@TaxRank:subspecies @TaxRank:species 821 82 8 22.16 0.082000 853 288 23 77.84 0.235487 @TaxRank:genus [...]

Instead of the @TaxRank: lines, why not just have a column named "TaxRank"? This would make the table a standard tab-delimited table that would be much easier to parse. Users can then easily parse the table by tax rank via tidyverse (dplyr) or other standard means.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hsinnan75/StrainPro/issues/5?email_source=notifications&email_token=ACOKJKON3VAVPQ2WCOCCSKDRDF2KXA5CNFSM4KN3BLIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL4ND2Y#issuecomment-586732011, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOKJKJ4OXKDEFJNSDOTGRTRDF2KXANCNFSM4KN3BLIA .

hsinnan75 commented 4 years ago

I've modified the output format. Please update StrainPro to v0.9.8. Thank you!

nick-youngblut commented 4 years ago

Thanks for modifying the format! While the new output will parse as a tsv, it is technically tab+space delimited, since you are using spaces to align columns. The user will have to strip off all spaces from the ends of the string-based values in each column (and then convert from character to numeric as necessary). This will cause confusion from a lot of users

hsinnan75 commented 4 years ago

Thanks for the suggestion. I've update StrainPro (still version 0.9.8). The output format is tsv.

nick-youngblut commented 4 years ago

Thanks!

One general question about the output: how is one supposed to interpret the confidence scores? I don't see any mention of confidence scores in the bioRxiv paper, and the README only states:

Confidence_score is the confidence score of that prediction.

What is a "good" confidence score? What confidence score cutoff was used for all of your evaluations in the bioRxiv paper?