jtlovell / GENESPACE

Other
191 stars 27 forks source link

"result would exceed 2^31-1 bytes" #170

Open CEPHAS-01 opened 1 month ago

CEPHAS-01 commented 1 month ago

Hi and thanks for this beautiful comparative genomics tool.

I was trying out Genespace on our HPC system using the human and sheep assemblies from NCBI but ran into the following error when trying to parse_annotations "result would exceed 2^31-1 bytes".

I have checked and I am sure that the machine is a 64-bit architecture. Any suggestions on how to resolve this?

Temitayo

LovellHAGSC commented 1 month ago

huh - funny you should mention this ... I just broke the dev version of DEEPSPACE with this same error. This happens when trying to generate an integer 2^31-1 ... for example position coordinate of a sequence > ~2.1Gb. I can't imagine how this would happen with GENESPACE though. Can you print the exact error and what step it came at?

CEPHAS-01 commented 4 weeks ago

Oh I see

The parse annotation step produced the error.

parsedPaths <- parse_annotations(

  • rawGenomeRepo = "genespace/source",
  • genomeDirs = c("human", "sheep"),
  • genomeIDs = c("human", "sheep"),
  • gffString = "gff",
  • faString = "fasta",
  • genespaceWd = "genespace/workspace") Error in paste(fa[1:100], collapse = "") : result would exceed 2^31-1 bytes

The genomes I am working with are quite large - human ~3GB and sheep ~2.8GB

perhaps some of the data type needs to be changed to increase the storage range.

LovellHAGSC commented 4 weeks ago

I don't think thats it ... unless all the chromosomes got concatenated. Pine broke it and it has several chromosomes that are as large as the entire Hg38 human genome.

CEPHAS-01 commented 4 weeks ago

The chromosomes were not concatenated. I used the genome as downloaded from NCBI.

LovellHAGSC commented 4 weeks ago

Can you post the urls to the files you downloaded from ncbi?

CEPHAS-01 commented 4 weeks ago

Sure Human genome and protein sequence from here: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/ [GCF_000001405.40_GRCh38.p14_genomic.fna.gz] [GCF_000001405.40_GRCh38.p14_protein.faa.gz]

Sheep genome and protein sequence from here: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/772/045/GCF_016772045.2_ARS-UI_Ramb_v3.0/ [GCF_016772045.2_ARS-UI_Ramb_v3.0_genomic.fna.gz] [GCF_016772045.2_ARS-UI_Ramb_v3.0_protein.faa.gz]

LovellHAGSC commented 4 weeks ago

did you try to pass parse_annotations these files? You want the translated_cds.faa.gz and genomic.gff.gz See: https://htmlpreview.github.io/?https://github.com/jtlovell/tutorials/blob/main/genespaceGuide.html

CEPHAS-01 commented 4 weeks ago

Yes, the parse_annotations stage produced the error. I was using the protein.faa.gz and not the translated_cds.faa.gz. Perhaps this is the reason. Stepping away from my desk shortly, I will test it with translated_cds.faa.gz and give you feedback. Thanks!

LovellHAGSC commented 4 weeks ago

It should give a more informative error than that if you gave it the protein fa ... that one just doesn't parse right. I was wondering if you fed the genomic.fna.gz as a gff.