iqbal-lab / cortex

reference free variant assembly
32 stars 13 forks source link

run_calls.pl line 2137 error #29

Closed Srividhya-Sainath closed 2 years ago

Srividhya-Sainath commented 2 years ago

Hi,

I am working with a few E.coli strains and wanted to go about reference-free variant calling using cortex. Here is my code

perl run_calls.pl --first_kmer 31 --fastaq_index /group/bioinf_ecoli_kmer/cortex/INDEX --auto_cleaning yes --genome_size 4000000 --bc yes --pd no --outdir ./results/ --outvcf result_trial1 --ploidy 1 --ref Absent --mem_height 18 --mem_width 100 --do_union yes --workflow joint --logfile logfile_trial1.txt --apply_pop_classifier --vcftools_dir /home/bioinf/vidhy/anaconda3/pkgs/vcftools-0.1.16-he513fc3_4/

I get the following error:

Unable to build /group/bioinf_ecoli_kmer/cortex/scripts/calling/results/binaries/uncleaned/31/SRR14272538.unclean.kmer31.ctx at run_calls.pl line 2137

My Index file:

SRR14272538 . /group/bioinf/cortex/raw/SRR14272538_1.fastq /group/bioin/cortex/raw/SRR14272538_2.fastq SRR14272623 . /group/bioinf/cortex/raw/SRR14272623_1.fastq /group/bioinf/cortex/raw/SRR14272623_2.fastq SRR14272622 . /group/bioinf/cortex/raw/SRR14272622_1.fastq /group/bioinf/cortex/raw/SRR14272622_2.fastq

This is relatively new for me. I would be grateful if you could help me with what I am missing here.

Thank you

iqbal-lab commented 2 years ago

If you look in the binaries/uncleaned/31 directory, and look in the log file there, is there an error at the end? Could you attach that file here please?

Srividhya-Sainath commented 2 years ago

Thank you for the quick response. Unfortunately, I don't have access to the file now. But the message mentioned something in the lines of memory use for Hash table, and if we used a quality filter to reduce the memory footprint.
I have a total of 103 Ecoli strains, and the genome size varies. So to calculate the correct --mem-width and height what would you suggest?

iqbal-lab commented 2 years ago

First we need to estimate how many kmers you have in 103 e coli. Due to its open pan genome, its more than just those implied by a 5Mb genome plus Snps. Let's say for now we think 103 genomes if we concatenated all the genes, would be 15Mb long. So let's guess 15 million kmers, and guess 15 million kmers due to sequencing errors.

This means we should choose mem height and width such that 2^mem-height × mem_width is about 15 million.

How much ram will that need? Well, see section 7 of the manual for details. The formula is

8+5C+1 bytes per kmer, where C is the number of samples, here 103.

8+5×103+1=559 bytes per kmer. Multiply by 15 million kmers, makes about 7.9 ×10^9 bytes

iqbal-lab commented 2 years ago

did this work out ok @vidhya-sai ?

Srividhya-Sainath commented 2 years ago

Hi, Yes this helped and I was able to make it work. Thank you.