limin321 commented 5 years ago

Hello I am teaching myself how to run panX using TestSet. Here is the command I run exactly following the instructions. ./panX.py -fn data/TestSet/ -sl TestSet -t 32 > TestSet.log 2> TestSet.err

However, I couldn't get the results as expected. Here is the error notification: Traceback (most recent call last): File "./panX.py", line 287, in myPangenome.process_clusters() File "/Users/dklabuser/limin/pan-genome-analysis/scripts/pangenome_computation.py", line 180, in process_clusters myClusterCollector.estimate_raw_core_diversity() File "/Users/dklabuser/limin/pan-genome-analysis/scripts/cluster_collective_processing.py", line 17, in estimate_raw_core_diversity self.folders_dict, self.strain_list, self.threads, self.core_genome_threshold, self.factor_core_diversity, self.species) File "/Users/dklabuser/limin/pan-genome-analysis/scripts/sf_core_diversity.py", line 102, in estimate_core_gene_diversity calculated_core_diversity=tmp_average_core_diversity(tmp_core_seq_path) File "/Users/dklabuser/limin/pan-genome-analysis/scripts/sf_core_diversity.py", line 42, in tmp_average_core_diversity with open(file_path+'tmp_core_diversity.txt', 'r') as tmp_core_diversity_file: IOError: [Errno 2] No such file or directory: '/Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/tmp_core/tmp_core_diversity.txt'

When I try to run my own bacteria strains, using *.gbk produced by prokka, I got exactly the same problems. By comparing the step-by-step turorial, seems the problem starts either in step5 or step6.

Could anyone help solve the issues, Really appreciate. It is hard for bioinformatic bigginners to tackle all these problems. Thank you so much.

rneher commented 5 years ago

please make sure you have all required dependencies installed. FastTree, RaXML, mcl, etc if it still doesn't work, please provide the log file. best, richard

limin321 commented 5 years ago

Hi rneher, thank you for the quick response. I have all the dependencies installed. I use "conda install -c bioconda raxml", it says packages installed. While when I run raxml, it says "command not found" . But according to the following link, I think I installed raxml already. http://www.metagenomics.wiki/tools/phylogenetic-tree/construction/raxml

after check all dependencies work, I ran the "./panX.py -fn data/TestSet -sl TestSet -t 32 > TestSet.log 2> TestSet.err" again, and got the same results as yesterday, still failed. here is the log file contents

Running panX in main folder: /Users/dklabuser/limin/pan-genome-analysis/data/TestSet/ ====== step01: strain list successfully loaded ====== starting step03: extract sequences from GenBank file ====== time for step03: 0.01 minutes (0.56 seconds)

====== starting step04: extract metadata from GenBank file ====== time for step04: 0.01 minutes (0.36 seconds)

====== starting step05: cluster proteins diamond inputfile: reference.faa diamond build index (makedb): 0.00 minutes (0.04 seconds) command line record: /Users/dklabuser/miniconda2/envs/panX/bin/diamond makedb -p 32 --in /Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/reference.faa -d /Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/nr_reference> /Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/diamond_makedb_reference.log 2>&1 diamond alignment (blastp): 0.00 minutes (0.11 seconds) diamond_max_target_seqs used: 600 command line record: /Users/dklabuser/miniconda2/envs/panX/bin/diamond blastp --sensitive -p 32 -e 0.001 --id 0 --query-cover 0 --subject-cover 0 -k 600 -d /Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/nr_reference -f 6 qseqid sseqid bitscore -q /Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/reference.faa -o /Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/query_matches.m8 -t ./ > /Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/diamond_blastp_reference.log 2>&1 command line mcl: mcl /Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/filtered_hits.abc --abc -o /Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/allclusters.tsv -I 1.5 -te 32 > /Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/mcl.log 2>&1 mcl runtime: 0.00 minutes (0.01 seconds)

====== time for step05: 0.00 minutes (0.19 seconds)

====== starting step06: align genes in geneCluster by mafft and build gene trees

Thank you so much. Looking forward to your answer.

rneher commented 5 years ago

what is inside?: /Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/diamond_blastp_reference.log /Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/mcl.log

limin321 commented 5 years ago

what is inside?: /Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/diamond_blastp_reference.log /Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/mcl.log

Here is what inside 👍:

/Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/diamond_blastp_reference.log diamond v0.9.24.125 | by Benjamin Buchfink buchfink@gmail.com Licensed under the GNU GPL https://www.gnu.org/licenses/gpl.txt Check http://github.com/bbuchfink/diamond for updates.

CPU threads: 32

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Temporary directory: ./ Opening the database... [6.6e-05s]

Target sequences to report alignments for: 600

Opening the input file... [5.7e-05s] Opening the output file... [0.000136s] Loading query sequences... [0.004388s] Masking queries... [0.029102s] Building query seed set... [0.000313s] Algorithm: Double-indexed Building query histograms... [0.021124s] Allocating buffers... [0.000515s] Loading reference sequences... [0.001095s] Building reference histograms... [0.020022s] Allocating buffers... [0.000508s] Initializing temporary storage... Too many open files [0.024015s] Error: Error opening file .//diamond-tmp-UGcaP8

/Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/mcl.log [mclxIOstreamIn] no assignments yield void/empty matrix [mcl] new tab created ___ [mclAlgorithmStart] attempting to cluster the void [mcl] pid 53992 ite chaos time hom(avg,lo,hi) m-ie m-ex i-ex fmv 1 0.00 0.00 0.00/340282346638528859811704183484516925440.00/0.00 0.00 0.00 0.00 -2147483648 [mcl] jury pruning marks: <100,100,100>, out of 100 [mcl] jury pruning synopsis: <100.0 or really really really good> (cf -scheme, -do log) [mcl] output is in /Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/allclusters.tsv [mcl] 0 clusters found [mcl] output is in /Users/dklabuser/limin/pan-genome-analysis/data/TestSet/protein_faa/diamond_matches/allclusters.tsv

Please cite: Stijn van Dongen, Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000. ( http://www.library.uu.nl/digiarchief/dip/diss/1895620/full.pdf or http://micans.org/mcl/lit/svdthesis.pdf.gz) OR Stijn van Dongen, A cluster algorithm for graphs. Technical Report INS-R0010, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000. ( http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z or http://micans.org/mcl/lit/INS-R0010.ps.Z)

iferres commented 5 years ago

Hi, I'm having the same issue. I think it's because no core genes were found, so this refinement step fails. I'm benchmarking PanX (and other pangenome reconstruction softwares) and I get this issue when challenge PanX against highly divergent pangenomes. I was wondering if you could improve error handling here or suggest an ad-hoc solution, I want to include PanX in the analysis but I need a final gene clustering to evaluate it. Thanks!!

fbaumdicker commented 5 years ago

Hi Ignacio, for pangenomes with high diversity there are a couple of parameters which might improve the clustering:

The default diamond e-value cutoff might be too low to report enough hits for the subsequent clustering step. You could try higher cutoffs with the option -dme. Alternatively, it is also possible to provide blast all-against-all results with the -bp option.
You can lower the mcl inflation parameter to get larger mcl clusters using the -imcl option
There is a soft core option in panX, where you can define the threshold for a gene to be considered part of the core genome. E.g. -cg 0.8 will consider gene clusters as core if the gene appears in more than 80% of your genomes.

rneher commented 5 years ago

@limin321 terribly sorry I forgot to answer here. your diamond run failed and no alignments were produced. Does diamond run by itself?

@iferres: do you have the exact same issue that diamond fails? or is the problem further down the road with clustering of diamond output or the core genome analysis? Do you have any output in the geneCluster directory? does the protein_faa/diamond_matches/allclusters.tsv contain output?

I fully agree that we need better error handling. There are in fact a lot of things that I'd like to improve....

iferres commented 5 years ago

I'm having this error:

  File "/pan-genome-analysis/panX.py", line 287, in <module>
    myPangenome.process_clusters()
  File "/pan-genome-analysis/scripts/pangenome_computation.py", line 180, in process_clusters
    myClusterCollector.estimate_raw_core_diversity()
  File "/pan-genome-analysis/scripts/cluster_collective_processing.py", line 17, in estimate_raw_core_diversity
    self.folders_dict, self.strain_list, self.threads, self.core_genome_threshold, self.factor_core_diversity, self.species)
  File "/pan-genome-analysis/scripts/sf_core_diversity.py", line 102, in estimate_core_gene_diversity
    calculated_core_diversity=tmp_average_core_diversity(tmp_core_seq_path)
  File "/pan-genome-analysis/scripts/sf_core_diversity.py", line 42, in tmp_average_core_diversity
    with open(file_path+'tmp_core_diversity.txt', 'r') as tmp_core_diversity_file:
IOError: [Errno 2] No such file or directory: '/export/home/iferres/Desktop/benchmark/panx/dataset/protein_faa/diamond_matches/tmp_core/tmp_core_diversity.txt'

..but just when I run PanX against a highly diverse pangenome.

The file protein_faa/diamond_matches/allclusters.tsv does exists, should I use that one? I was expecting to use the allclusters_final.tsv file, which it doesn't in cases where I get the above error.

Nice piece of software btw. Thanks!

rneher commented 5 years ago

Ok. your problem is that it didn't identify any core genes. That is not uncommon when using very diverse sequences. Did you try running with -cg 0.8 (https://github.com/neherlab/pan-genome-analysis/blob/master/advanced_options.md#core-genome-cutoff)?

you should be able to just pick things up at step 6, no need to rerun the diamond and clustering steps.

iferres commented 5 years ago

Great, I will try changing this parameter and the ones suggested by @fbaumdicker . Thanks again!

davised commented 2 years ago

I just got this same error.

Would it be possible to add a check to see if there are any core proteins before attempting to continue?

neherlab / pan-genome-analysis

Fail to run TestSet using panX #23

CPU threads: 32

Target sequences to report alignments for: 600