steineggerlab / ufcg

UFCG: Universal Fungal Core Genes
https://ufcg.steineggerlab.com
GNU General Public License v3.0
29 stars 0 forks source link

Java OutOfBoundsException on some genomes #8

Open BrigidaGallone opened 1 year ago

BrigidaGallone commented 1 year ago

Hello,

I am using the latest version of the pipeline (v.1.0.3) and I am testing a large group of genomes from ncbi (very variable in quality) using the profile module both with PROT and BUSCO. For some genomes I got a java exception as follow:

Screenshot 2023-02-24 at 11 45 09

In this case, the genome GCA_017580835.1 had no errors with PRO profiling but the BUSCO profiling did not work. A few genomes failed (with the same error) with the PRO profiling: GCA_900068945.1 GCA_900068915.1 GCA_900069095.1 GCA_900068965.1 GCA_900068985.1 GCA_018221805.1 GCA_900068975.1 GCA_900068955.1

What does the error mean and do you have any idea about what is causing it?

Thanks a lot for your help and the amazing pipeline!

Best, Brigida

endixk commented 1 year ago

Dear Brigida,

Thank you for reaching out! Based on the error message you provided, it appears that the result file of the fastBlockSearch run is corrupted or improperly formatted. You can run the pipeline single-threaded with the --dev option to identify the problematic sub-command.

I attempted to reproduce the error using the assemblies associated with the accession numbers you provided. However, I was able to successfully run both BUSCO profiling of GCA_017580835.1 and PRO profiling of GCA_900068945.1 without any errors on my system. If possible, please provide the assembly files that caused the issue for further investigation.

One common feature I could find among these assemblies is that they contain a large number of extremely short DNA contigs. This caused a significant reduction of computational speed from my system, and may be the cause of the error you reported. My hypothesis is that rejecting FASTA entries with fewer than a given threshold of base pairs (e.g., 1,000 bps) may resolve this issue.