vinuesa / get_phylomarkers

A pipeline to select optimal markers for microbial phylogenomics and species tree estimation using the multispecies coalescent and concatenation approaches
Other
51 stars 8 forks source link

Remove different number of sequences from output #3

Closed camilo-monge closed 5 years ago

camilo-monge commented 5 years ago

Hello, I am getting the same issue as https://github.com/vinuesa/get_phylomarkers/issues/2. My question is, is there a way to identify which fasta files generate the error given their number, so I can manually remove them instead of having to run the pipe again? Or can I rerun the pipe using -e without having to perform the blasts again? I have limited computing power, so I do not want to wait another week.

Just to clarify, I'm getting this error `>>> ERROR: Input FASTA files do not contain the same number of sequences...

107 126 128 132 143 145 146 153 154 155 156 157 158 165 234 78 79 80 81 83 85 87 88 91 92 95 ` But my fastas have names from 97420 to 100927

Thank you so much

vinuesa commented 5 years ago

Hi Camilo, thanks for your feedback.

The current version of GET_PHYLOMARKERS assumes that the input clusters (pairs of faa and fna files generated by GET_HOMOLOGUES) have the same number of sequences, 1 per input genome.

The output you display above indicates that this is clearly not the case. Could you please verify this by running this simple command within the directory holding your input cluster files: grep -c '>' .fna && grep -c '>' .faa?

I guess the problem is that you have not filtered the orthologues out of your set of homologues. This can be conveniently done with the compare_clusters.pl script from the GET_HOMOLOGUES suite, as detailed in the GET_HOMOLOGUES manual or GET_PHYLOMARKERS tuturial: https://github.com/vinuesa/get_phylomarkers/blob/master/docs/GET_PHYLOMARKERS_manual.md#get_phylomarkers-tutorial

Please try a line similar to the following one: compare_clusters.pl -d ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_12taxa_algBDBHe1,\ ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algCOGe0,\ ./KlebsiellapneumoniaeplasmidpNDM-KNNC019153_f0_0taxa_algOMCLe0 -o core_BCM -t 12

using your list of comma-separated directory names, and substituting -t 12 with the #GENOMES of your case

Run this line a second time, adding the -n flag, so that you get also the corresponding *.fna files.

Hope this helps and fixes your problem. Don't hesitate in asking for more feedback, should you need it.

camilo-monge commented 5 years ago

Thanks for the help sir. I see the problem now, thanks to the command that you gave me. To compare clusters it is necessary to run various algorithms, so this is technically necessary to run the application properly? I have limited computing resources, so I was trying to avoid that.

vinuesa commented 5 years ago

You are welcome Camilo! Contact me if you encounter any further problems with your runs. Good luck!

vinuesa commented 5 years ago

You do not need to run the different clustering algorithms. That is only done, if you want to compute consensus clusters, i.e., those found by all three (BDBH, COGtriangles and OMCL) currently implemented in GET_HOMOLOGUES. Depending on what the goals of your analyses, you could run the fast (heuristic) BDBH algorithm, invoked with get_homologues.pl -b (only if you want to get quickly a core genome, for example to run GET_PHYLOMARKERS). This latter option is not useful to perform pan-genome analyses. In general, I'd recommend to run get_homologues -M -t 0, wich uses OMCL (-M) to compute clusters of all sizes (-t 0), wich allows you to compute pan-genomes and core-genomes (running compare_clusters.pl -d ./your_f0_0taxa_algOMCLe0 -o core_M -t #your_number_of_genomes, as explained above.

Hope this helps