gtonkinhill / panstripe

post processing of bacterial pangenome gene presence/absence matrices
GNU General Public License v2.0
50 stars 6 forks source link

Long processing times? #17

Closed geboro closed 3 weeks ago

geboro commented 1 month ago

Hello!

I did not find any references to processing times. I have seven bacterial genomes (3.9 Mb) and panstripe(pa, tree) has been running for 16 hrs on server with 32 (Xeon)CPUs (64bit) with 512 GB RAM. I restarted the job a couple of times after ~6hrs and it displayed several [warnings ](glm.fit: algorithm did not converge) warnings, which led me to believe it might actually be running correctly.

Is this normal? do you have any gross estimation of how long it takes to process?

Cheers and thanks for panaroo/panstripe!

gtonkinhill commented 1 month ago

Hi,

That's strange. Panstripe would normally run in a matter of minutes on much larger datasets. It sounds like there might be an error causing the process to hang. What are the dimensions if your presence absence matrix?

geboro commented 1 month ago

Hello again,

The dimensions of the presence/absence matrix are 7 x 4216, and I generated it with panaroo. Most of the columns are '1' in all species, since according to panaroo 3245 of them are core genes.

I stopped the process after two days and warnings reutrned 50 instances of the same message: glm.fit: algorithm did not converge.

Thanks a lot!

geboro commented 1 month ago

Ok I re-run it with less bootstraps and just by specifying the number of bootstraps with nboot=100 the processing end, but with a new error warning that I guess is the culprit:

Error in value[[3L]](cond) : 
  Panstripe model fit failed! This can sometime be caused by unusual branch lengths.
Setting fit_method='glmmTMB' or family='quasipoisson' or 'gaussian' often provides a more stable fit to difficult datasets
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Error in value[[3L]](cond) : 
  Panstripe model fit failed! This can sometime be caused by unusual branch lengths.
Setting fit_method='glmmTMB' or family='quasipoisson' or 'gaussian' often provides a more stable fit to difficult datasets
In addition: Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: algorithm did not converge 
3: glm.fit: algorithm did not converge 
4: glm.fit: algorithm did not converge 

So I rerun it using fit_method=glmmTMB but then again failed with error:

Timing stopped at: 0.004 0 0.005
Error in (function (start, objective, gradient = NULL, hessian = NULL,  : 
  NA/NaN gradient evaluation
In addition: There were 11 warnings (use warnings() to see them)
Timing stopped at: 0.005 0 0.004
Error in (function (start, objective, gradient = NULL, hessian = NULL,  : 
  NA/NaN gradient evaluation
Timing stopped at: 0.005 0 0.005
Error in (function (start, objective, gradient = NULL, hessian = NULL,  : 
  NA/NaN gradient evaluation
Timing stopped at: 0.005 0 0.005
Error in (function (start, objective, gradient = NULL, hessian = NULL,  : 
  NA/NaN gradient evaluation
Timing stopped at: 0.004 0 0.005
Error in (function (start, objective, gradient = NULL, hessian = NULL,  : 
  NA/NaN gradient evaluation
Timing stopped at: 0.004 0 0.005
Error in (function (start, objective, gradient = NULL, hessian = NULL,  : 
  NA/NaN gradient evaluation
Error in statistic(data, i[r, ], ...) : 
  Model fitting failed to converge in bootstrap replicate!

Finally, I used family=gaussian as suggested in the documentation and it successfully finished this time.

In conclusion, I guess there are too few species and they are too closely related, so that there are too few genes with differential abundances across species and thus they are insufficient for the statistical analyses?

Cheers

gtonkinhill commented 1 month ago

Hi,

Yes, 7 genomes is not likely to be sufficient. I would use descriptive statistics instead for a dataset of this size. Perhaps a histogram (or 'U-plot') of the size of each gene cluster would be suitable.