makovalab-psu / DiscoverY

K-mer based classifier for Y-contig identification from Whole Genome Assemblies
MIT License
11 stars 5 forks source link

specifying bloom filter size/overflowerror #10

Open eocampbe opened 4 years ago

eocampbe commented 4 years ago

Hi there,

I am relatively new to python and trying to run discoverY.py in female+male mode using male_contigs.fasta, kmers_from_male_reads, and female reference assembly (female.fasta) files. I am running python 3.7.4,and all the dependencies are installed properly. I created the kmers_from_male_reads file using DSK as per the readme file, and the command I used to run discoverY.py is:

python discoverY.py --mode female+male --kmer_size 25

When I run this, I get this output:

Started DiscoverY Mode female+male Using default of k=25 and input folder='data' Please set bloom filter size before running this program Shortlisting Y-contigs Need to make Bloom Filter of k-mers from female Traceback (most recent call last): File "./discoverY.py", line 59, in main() File "./discoverY.py", line 54, in main classify_ctgs.classify_ctgs(k_size, bloom_filt, female_kmers, mode) File "/lustre04/scratch/eocampbe/DiscoverY/scripts/classify_ctgs.py", line 142, in classify_ctgs female_kmers_bf = getbloomFilter(bf, fem_kmers, kmer_size) File "/lustre04/scratch/eocampbe/DiscoverY/scripts/classify_ctgs.py", line 20, in getbloomFilter female_kmers_bf = BloomFilter(bf_size, .001, bf_filename) File "src/pybloomfilter.pyx", line 87, in pybloomfilter.BloomFilter.cinit OverflowError: value too large to convert to int

I'm finding it difficult to determine how I might fix this issue. For instance, is the line "Please set bloom filter size before running this program" the source of this error? I can't figure out how I would specify bloom filter size, as there appears to be no option to do so and I can't find any documentation about this in the readme file. Or, is this primarily a memory issue, indicated by the OverflowError? Any help you could give me would be much appreciated!

md5sam commented 4 years ago

Hi @eocampbe ,

This is likely an issue with bloom filter size. I have just now merged a Pull Request submitted by @rsharris which lets you specify the bloom filter size, and this might be useful to you.

In order to do so, please first perform a git pull to get the latest version of DiscoverY. Subsequently, please see lines 18-20 of discoverY.py, which indicates how to specify bloom filter size using the command line argument "--female_bloom_capacity".

rsharris commented 4 years ago

@eocampbe IIRC, You'll want to specify a bloom filter size that is about the expected length of your genome, minus repeats. I.e. to the number of distinct kmers you expect in your input data. The only downside of setting it too high is it will use more memory.

I think the default value was about 3G, which relates to the human genome size (but doesn't adjust downward for repeat content). And the corresponding bloom filter data structure was something like 5G bytes.

eocampbe commented 4 years ago

Thank you @md5sam and @rsharris, this is very helpful!

The female genome size I'm working with is ~214 mb, so I set that value using the --female_bloom_capacity argument, and it seems to be running now.

eocampbe commented 4 years ago

Hi again @md5sam and @rsharris,

I am now getting another issue when I try to run discoverY.py. When I use the basic command using either a female bloom filter I created OR the example data provided, like this: python3 ./discoverY.py --mode female+male --female_bloom

I get the following error: File "./discoverY.py", line 69, in main() File "./discoverY.py", line 64, in main classify_ctgs.classify_ctgs(k_size, bloom_filt, bf_capacity, female_kmers, mode) UnboundLocalError: local variable 'bf_capacity' referenced before assignment

Any ideas as to what might be causing this?

rsharris commented 4 years ago

I'm sorry, that was my mistake.

I'll make a correction to my fork and issue a pull request.

I'm not the owner of this repo, though. So, if you want to get up and running right away, the change will be to add "bf_capacity = None" after line 43 in discoverY.py, so that it looks like this:

    if not args['kmer_size']:
        k_size = 25
        bf_capacity = None 
    else:

You'd need to be sure to use 8 spaces in front of "bf_capacity", not tab characters.

eocampbe commented 4 years ago

Great, thanks! I've added that line and it seems to be working now.

md5sam commented 4 years ago

Thanks @rsharris, I've now merged your PR.