makovalab-psu / DiscoverY

K-mer based classifier for Y-contig identification from Whole Genome Assemblies
MIT License
11 stars 5 forks source link

memory #14

Open RAWWiberg opened 2 years ago

RAWWiberg commented 2 years ago

Hi, This is probably related to issues #13 and #12 but since there are no helpful answers to those I have opened a new one. I am running DiscoverY for what is probably a large dataset (though the genome size is ~1.2Gb, so smaller than human), and I keep getting a python MemoryError:

Started DiscoverY
Mode female+male
Using default of k=25 and input folder='data'
Shortlisting Y-contigs
Need to make Bloom Filter of k-mers from female
Done creating bloom filter
Generating a dictionary from kmers in kmers_from_male_reads
Traceback (most recent call last):
  File "discoverY.py", line 70, in <module>
    main()
  File "discoverY.py", line 65, in main
    classify_ctgs.classify_ctgs(k_size, bloom_filt, bf_capacity, female_kmers, mode)
  File "/scratch/24769731/DiscoverY/scripts/classify_ctgs.py", line 143, in classify_ctgs
    classify_fm_male_mode(kmer_size, female_kmers_bf)
  File "/scratch/24769731/DiscoverY/scripts/classify_ctgs.py", line 52, in classify_fm_male_mode
    kmer_abundance_dict_from_male = kmers.make_dict_from_kmer_abundance(reads_kmers, kmer_size)
  File "/scratch/24769731/DiscoverY/scripts/kmers.py", line 44, in make_dict_from_kmer_abundance
    kmer_dicts[line[:kmer_size]] = current_abundance
MemoryError

The input files end up quite large:

1.7G female.bloom
23G female_kmers
191G kmers_from_male_reads
1.2G male_contigs.fasta

But I have requested compute resources with 1TB of RAM and the usage states say that the job uses a maximum of 600GB.

Any help would be appreciated.