marbl / harvest

Other
51 stars 11 forks source link

Problem running parsnp #3

Closed ejfresch closed 10 years ago

ejfresch commented 10 years ago

First of all...thank you very much for this software. I have a problem running parsnp on Linux (I am using the precompiled package). I cannot get rid of this error:

lfreschi@katak:~/sw/harvest/all/harvest-Linux64-v1.0.1> ./parsnp -v -d gen/ -r PA14.fasta Warning: Cannot determine OS, defaulting to linux |--Parsnp v1.0--| For detailed documentation please see --> http://harvest.readthedocs.org/en/latest sh: symbol lookup error: /scratch/_MEIKYhGiu/libreadline.so.5: undefined symbol: PC sh: symbol lookup error: /scratch/_MEIKYhGiu/libreadline.so.5: undefined symbol: PC sh: symbol lookup error: /scratch/_MEIKYhGiu/libreadline.so.5: undefined symbol: PC sh: symbol lookup error: /scratch/_MEIKYhGiu/libreadline.so.5: undefined symbol: PC sh: symbol lookup error: /scratch/_MEIKYhGiu/libreadline.so.5: undefined symbol: PC sh: symbol lookup error: /scratch/_MEIKYhGiu/libreadline.so.5: undefined symbol: PC


SETTINGS: |-aligner: libMUSCLE |-seqdir: gen/ |-outdir: /project/rclevesq/users/lfreschi/sw/harvest/all/harvest-Linux64-v1.0.1/P_2014_10_20_120717242134 |-OS: linux |-threads: 32


-->Reading Genome (asm, fasta) files from gen/.. |->[OK] -->Reading Genbank file(s) for reference (.gbk) .. |->[WARNING]: no genbank file provided for reference annotations, skipping.. -->Calculating MUMi.. /bin/bash: symbol lookup error: /scratch/_MEIKYhGiu/libreadline.so.5: undefined symbol: PC ERROR The following command failed:

/scratch/_MEIKYhGiu/parsnp /project/rclevesq/users/lfreschi/sw/harvest/all/harvest-Linux64-v1.0.1/P_2014_10_20_120717242134/all_mumi.ini Please veryify input data and restart Parsnp. If the problem persists please contact the Parsnp development team. ERROR

I set the TMP, TMPDIR and TEMP environmental variables and I am sure that on /scratch there is enough space. From the documentation it is not clear to me what I should do.

UPDATE: I tried to install the Mac version of parsnp on my laptop and, well, it works perfectly. However, it would be really important for me to be able to run the analyses on Linux. Could you please give some hints about how to solve this problem?

treangen commented 10 years ago

hi,

I tried to install the Mac version of parsnp on my laptop and, well, it works perfectly.

good to know it is working as it should on OSX

However, it would be really important for me to be able to run the analyses on Linux.

could you provide a few additional details about your Linux distribution/OS ? Looks like it probably is related to a well known libncurses vs libtermcap issue (https://bugzilla.redhat.com/show_bug.cgi?id=499837). Building Parsnp from source should resolve the error, but I will try to rebuild the binary explicitly linking to ncurses & send it to you to see if this helps.

ejfresch commented 10 years ago

thank you very much for your answer.

could you provide a few additional details about your Linux distribution/OS ?

Sure. The server has a SUSE Linux Enterprise Server 11 linux distribution. The kernel version is 3.0.101-0.35-default

Building Parsnp from source should resolve the error, but I will try to rebuild the binary explicitly linking to ncurses & send it to you to see if this helps.

I am really thankful. Yes, this would definitively help me.

ejfresch commented 10 years ago

I also tried to compile the software on Linux:

git clone https://github.com/marbl/parsnp.git parsnp_src
cd parsnp_src/muscle/
./autogen.sh
./configure --prefix=`pwd`
make install
cd ..
./autogen.sh
./configure --prefix=`pwd`
make install
export PARSNPDIR=`pwd`

I did not get any error, but when I run the program I get this message:

./parsnp -r ../../../../tasks/comparison_genomes_PA14/PA14refVSseq/PA14.fasta -d ../../../../tasks/tree_harvest/run4/ -g ../../../../tasks/comparison_genomes_PA14/PA14refVSseq/NC_008463.gbk
 Cannot open reference file !

The reference is a fasta file and the path is correct:

lfreschi@katak:~/sw/harvest/parsnp_src/bin> head ../../../../tasks/comparison_genomes_PA14/PA14refVSseq/PA14.fasta
>gi|116048575|ref|NC_008463.1| Pseudomonas aeruginosa UCBPP-PA14 chromosome, complete genome
TTTAAAGAGACCGGCGATTCTAGTGAAATCGAACGGGCAGGTCAATTTCCAACCAGCGATGACGTAATAG
ATAGATACAAGGAAGTCATTTTTCTTTTAAAGGATAGAAACGGTTAATGCTCTTGGGACGGCGCTTTTCT
GTGCATAACTCGACGAAGCCCAGCAACTGCGTGTTTCTCCGGCAGGCAAAAGGTTGTCGAGAACCGGTGT
CGAGGCTGTTTCCTTCCTGAGCGAAGCCTGGGGATGAACGAGATGGTTATCCACAGCGGTTTTTTCCACA
CGGCTGTGCGCAGGGATGTCCCCCCTTCAAAGCAAGGGTTATCCACAAAGTCCAGGACGACCGTCCGTCG
GCCTGCCTGCTTTTATTAAGGTCTTGATTTGCTTGGGGCCTCAGCGCATCGGCATGTGGATAAGTCCGGC
CCGTCCGGCTACAATAGGCGCTTATTTCGTTGTGCCGCCTTTCCAATCTTTGGGGGATATCCGTGTCCGT
GGAACTTTGGCAGCAGTGCGTGGATCTTCTCCGCGATGAGCTGCCGTCCCAACAATTCAACACCTGGATC
CGTCCCTTGCAGGTCGAAGCCGAAGGCGACGAATTGCGTGTGTATGCACCCAACCGTTTCGTCCTCGATT

If you think it may help I can post the messages I get for the different steps of the installation.

treangen commented 10 years ago

either '-r' or '-g' options should be specified, but not both as the reference fasta file is parsed out from the genbank (gbk) file. can you try again with only -g specified ?

ejfresch commented 10 years ago

Thank you very much for your reply.

either '-r' or '-g' options should be specified, but not both as the reference fasta file is parsed out from the genbank (gbk) file. can you try again with only -g specified ?

Sure. Here is the output:

lfreschi@katak:~/sw/harvest/parsnp_src/bin> ./parsnp -g ../../../../tasks/comparison_genomes_PA14/PA14refVSseq/NC_008463.gbk -d ../../../../tasks/tree_harvest/run4/
 Cannot open reference file !

I also check that the path and the file are ok:

lfreschi@katak:~/sw/harvest/parsnp_src/bin> head ../../../../tasks/comparison_genomes_PA14/PA14refVSseq/NC_008463.gbk
LOCUS                            6537648 bp    dna     circular UNK
DEFINITION  Pseudomonas aeruginosa UCBPP-PA14 chromosome, complete genome.
ACCESSION   NC_008463
COMMENT     Product name confidence:
            Class 1: Function experimentally demonstrated in same species.
            Class 2: Function of highly similar gene experimentally
            demonstrated in another organism (and gene context consistent in
            terms of pathways its involved in, if known).
            Class 3: Function proposed based on presence of conserved amino
            acid motif, structural feature or limited sequence similarity to
treangen commented 10 years ago

Looks like a write permission related issue, when parsing the GenBank file a fasta file is written to the input GenBank file directory, but it doesn't exist due to an error (likely read-only dir). You can confirm this by looking for a NC_008463.gbk.fna file in:

../../../../tasks/comparison_genomes_PA14/PA14refVSseq/

1) If indeed the case, you can resolve this by copying the reference gbk file to a different (non read-only) directory, or alternatively by using the -r option to indicate a fasta reference file. I'll update Parsnp to output a much more descriptive error message.

2) If this is not the case, can you paste the line after "[Reference]" in the INI file (located in the output directory) ? This will report where it is expecting the reference fasta file to be located on disk.

ejfresch commented 10 years ago

I first checked the scenario (1):

lfreschi@katak:~/sw/harvest/parsnp_src/bin> chmod 777 ../../../../tasks/comparison_genomes_PA14/PA14refVSseq/
lfreschi@katak:~/sw/harvest/parsnp_src/bin> ls -l ../../../../tasks/comparison_genomes_PA14/
total 8
drwxr-xr-x 3 lfreschi rclevesq 4096 Oct 20 10:03 PA14refVSmvfr
drwxrwxrwx 2 lfreschi rclevesq 4096 Oct 17 14:30 PA14refVSseq
lfreschi@katak:~/sw/harvest/parsnp_src/bin> ./parsnp -g ../../../../tasks/comparison_genomes_PA14/PA14refVSseq/NC_008463.gbk -d ../../../../tasks/tree_harvest/run4/
 Cannot open reference file !

and

lfreschi@katak:~/sw/harvest/parsnp_src/bin> ./parsnp -r ../../../../tasks/comparison_genomes_PA14/PA14refVSseq/PA14.fasta -d ../../../../tasks/tree_harvest/run4/
 Cannot open reference file !

For the scenario (2): I do not get any INI file and if I just run the program without any parameter this is what I get:

lfreschi@katak:~/sw/harvest/parsnp_src/bin> ./parsnp
ERROR: No ini file specified!
treangen commented 10 years ago

I should have noticed this sooner, to run parsnp, you will need to use:

python Parsnp.py -g <input.gbk> -d <genome dir>

You were running the parsnpAligner binary, not the Parsnp.py python run script. Please disregard my earlier suggestions as I failed to noticed this; the run instructions are slightly different since this is compiled from source.

Also, you will need the following Python packages installed to run:

numpy, multiprocessing

and optionally:

dendropy

Also, you may need to link the parsnpAligner binary to the same location as Parsnp.py (if install did not do this), via:

cd ~/sw/harvest/parsnp_src
ln -s ./bin/parsnpA_linux parsnpA

In parallel, I will try to send you the compiled binary, but please keep me posted if you experience further issues etc.

ejfresch commented 10 years ago

Hello, thank you very much. Things are starting to work. Here is my output now:

lfreschi@katak:~/tasks/tree_harvest> python ../../sw/harvest/parsnp_src/Parsnp.py -g NC_008463.gbk -d run4/ -p10
|--Parsnp v1.0--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest
**********************************************************************************************
SETTINGS:
|-aligner:      libMUSCLE
|-seqdir:       run4/
|-outdir:       /project/rclevesq/users/lfreschi/tasks/tree_harvest/P_2014_10_23_084402102215
|-OS:           Linux
|-threads:      10
**********************************************************************************************
-->Reading Genome (asm, fasta) files from run4/..
  |->[OK]
-->Reading Genbank file(s) for reference (.gbk) NC_008463.gbk..
  |->[OK]
-->Calculating MUMi..
  |->[OK]
-->Running Parsnp multi-MUM search and libMUSCLE aligner..
  |->[OK]
-->Determining repetitive regions..
  |->[SKIP]
-->Running PhiPack on LCBs to detect recombination..
  |->[SKIP]
-->Reconstructing core genome phylogeny..
  |->[OK]
-->Creating Gingr input file..
  |->[OK]
-->Calculating wall clock time..
  |->Aligned 95 genomes in 21.62 minutes

<<Parsnp finished! All output available in /project/rclevesq/users/lfreschi/tasks/tree_harvest/P_2014_10_23_084402102215>>

Validating output directory contents...
        1)parsnp.tree:          newick format tree                      [OK]
        2)parsnp.vcf:           VCF format SNP calls                    [OK]
        3)parsnp.ggr:           harvest input file for gingr (GUI)      [OK]
        4)parsnp.xmfa:          XMFA formatted multi-alignment          [OK]

The doubts I have now are: (1) why the software is skipping some steps of the analysis? (2) there is no parsnpA_linux file in my bin folder:

lfreschi@katak:~/sw/harvest/parsnp_src/bin> ls
fasttree_linux  fasttree_osx  harvest_linux  harvest_osx  parsnp  Profile_linux  Profile_osx

Is it ok?

ejfresch commented 10 years ago

When I have a lot of genomes, the program takes 7 genomes and builds a tree with them. How can I get the tree of all sequences?

Example of the program output (>200 genomes):

lfreschi@katak:~/tasks/tree_harvest> python ../../sw/harvest/parsnp_src/Parsnp.py -g PA14.gbk -d db_strains_pseudo -v
|--Parsnp v1.0--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest
**********************************************************************************************
SETTINGS:
|-aligner:      libMUSCLE
|-seqdir:       db_strains_pseudo
|-outdir:       /project/rclevesq/users/lfreschi/tasks/tree_harvest/P_2014_10_24_141642412016
|-OS:           Linux
|-threads:      32
**********************************************************************************************
-->Reading Genome (asm, fasta) files from db_strains_pseudo..
  |->[OK]
-->Reading Genbank file(s) for reference (.gbk) PA14.gbk..
  |->[OK]
-->Calculating MUMi..

*****************************************************

parsnpAligner:: rapid whole genome SNP typing

*****************************************************

ParSNP: Preparing to construct global multiple alignment framework

Step 1: Preparing to verify and process input sequences...
Calculating mumi distances..

        Step 2a: constructing compressed suffix graph...
        Step 2b: Calculting pairwise MUMi distances...
  |->[OK]
RECRUITED GENOMES:

        Ook6Oqu7.fasta
        dei1Iequ.fasta
        aeHogh7r.fasta
        dei3Cief.fasta
        fie7OWay.fasta
        air3Xa9i.fasta

-->Running Parsnp multi-MUM search and libMUSCLE aligner..

*****************************************************

parsnpAligner:: rapid whole genome SNP typing

*****************************************************

ParSNP: Preparing to construct global multiple alignment framework

Step 1: Preparing to verify and process input sequences...
Step 2: Searching for initial MUM anchors...

        Step 2a: constructing compressed suffix graph...
        Step 2b: performing initial search for exact matches in the sequences...
Step 3: Performing recursive MUM search between MUM anchors...
Step 4: Filtering spurious matches...
Step 5: Creating and verifying final LCBs...
Step 6: Calculating and creating inter-LCB regions...
Step 7: Writing output files & aligning LCBs...
ParSNP: Finished core genome alignment
  |->[OK]
-->Determining repetitive regions..
  |->[SKIP]
-->Running PhiPack on LCBs to detect recombination..
  |->[SKIP]
-->Reconstructing core genome phylogeny..
FastTree Version 2.1.5 SSE3, OpenMP (60 threads)
Alignment: /project/rclevesq/users/lfreschi/tasks/tree_harvest/P_2014_10_24_141642412016/parsnp.snps.mblocks
Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 100
Search: Exhaustive (slow) +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: no
ML Model: Jukes-Cantor, CAT approximation with 20 rate categories
Ignored unknown character X (seen 8 times)
Initial topology in 0.05 seconds
Refining topology: 11 rounds ME-NNIs, 2 rounds ME-SPRs, 6 rounds ML-NNIs
      0.10 seconds: SPR round   1 of   2, 1 of 12 nodes
      0.41 seconds: ME NNI round 4 of 11, 1 of 5 splits
      0.75 seconds: ME NNI round 7 of 11, 1 of 5 splits
Total branch-length 3.048 after 0.79 sec
      0.96 seconds: ML NNI round 1 of 6, 1 of 5 splits
ML-NNI round 1: LogLk = -118702.562 NNIs 2 max delta 0.09 Time 1.22
      1.23 seconds: Site likelihoods with rate category 1 of 20
      1.34 seconds: Site likelihoods with rate category 10 of 20
      1.45 seconds: Site likelihoods with rate category 19 of 20
Switched to using 20 rate categories (CAT approximation)
Rate categories were divided by 0.639 so that average rate = 1.0
CAT-based log-likelihoods may not be comparable across runs
ML-NNI round 2: LogLk = -117933.860 NNIs 1 max delta 0.00 Time 1.69
Turning off heuristics for final round of ML NNIs (converged)
      1.69 seconds: ML NNI round 3 of 6, 1 of 5 splits
ML-NNI round 3: LogLk = -117933.860 NNIs 0 max delta 0.00 Time 1.86 (final)
      1.86 seconds: ML Lengths 1 of 5 splits
Optimize all lengths: LogLk = -117933.860 Time 2.01
      2.41 seconds: Site likelihoods with rate category 1 of 20
      2.52 seconds: Site likelihoods with rate category 11 of 20
      2.64 seconds: Optimizing alpha round 1
      3.14 seconds: Optimizing alpha round 2
Gamma(20) LogLk = -118772.234 alpha = 9.986 rescaling lengths by 1.121
Total time: 3.43 seconds Unique: 7/7 Bad splits: 0/4
  |->[OK]
-->Creating Gingr input file..
  |->[OK]
-->Calculating wall clock time..
  |->Aligned 7 genomes in 8.33 minutes

<<Parsnp finished! All output available in /project/rclevesq/users/lfreschi/tasks/tree_harvest/P_2014_10_24_141642412016>>

Validating output directory contents...
        1)parsnp.tree:          newick format tree                      [OK]
        2)parsnp.vcf:           VCF format SNP calls                    [OK]
        3)parsnp.ggr:           harvest input file for gingr (GUI)      [OK]
        4)parsnp.xmfa:          XMFA formatted multi-alignment          [OK]
treangen commented 10 years ago

By default, Parsnp employs a conservative approach to genome recruitment from the provided database. In short, it looks at the collection of MUMi distances with respect to the selected reference genome and automatically determine a cutoff based on the distribution of values. This in practice results in Parsnp recruiting only the most closely related genomes (highly similar strains), and can result in the exclusion of several of the provided genomes (as in your case with your Pseudomonas genome collection). To force the alignment of all of the genomes in the provided directory (+/- 30% reference genome size), add the to the command-line parameters:

-c

Which indicates the database is 'curated' and should be trusted. However, this may result in under -alignment of the set of interest depending on the divergence levels within the provided collection.

An alternative option would be to increase the maximum MUMi value allowed, via:

-U 0.05

This should result in additional genomes being recruited without forcing the alignment of all of the genomes in your database.

Regarding your two earlier points, both are ok. PhiPack is disabled by default (enable with -x) and the binary is correctly named.

treangen commented 10 years ago

Closing due to resolution of original issue.