iqbal-lab / cortex

reference free variant assembly
32 stars 13 forks source link

Running the run_calls wrapper with different numbers of samples produces different SNPs #17

Open bioinfo11 opened 6 years ago

bioinfo11 commented 6 years ago

Hi Zam,

I have three bacterial samples that I want to use to look for SNPs against an assembled reference. To do this, I have tried two methods using the run_calls wrapper. The first was to run each sample against the reference individually. The second, was to run all three samples together against the reference.

The command I used was:

/path/to/CORTEX_release_v1.0.5.17/scripts/calling/run_calls.pl --first_kmer 31 --fastaq_index /path/to/index --auto_cleaning yes --bc yes --pd no --outdir output --outvcf outputvcf --ploidy 1 --stampy_hash stampy_refhash --stampy_bin /path/to/stampy-1.0.31/stampy.py --list_ref_fasta list_ref_fasta --refbindir /path/to/working_directory --genome_size 2800000 --qthresh 5 --mem_height 17 --mem_width 1000 --vcftools_dir /path/to/vcftools_0.1.11 --do_union yes --ref CoordinatesOnly --workflow joint --logfile logfile_output log_1.txt

When I compared the two methods using the 'wk_flow_J_RefCO_FINALcombined_BC_calls_at_all_k.decomp.vcf' I was expecting to see the same SNPs/positions called for each sample, regardless of the method used, however this was not the case. The only difference between the two methods was the index files used:

For running samples individually (e.g. for sample 1):

cat index sample1 . /path /to/sample1_R1 /path /to/sample1_R2

For running samples together:

cat index sample1 . /path /to/sample1_R1 /path /to/sample1_R2 sample2 . /path /to/sample2_R1 /path /to/sample2_R2 sample3 . /path /to/sample3_R1 /path /to/sample3_R2

Am I doing something wrong?

Thanks

iqbal-lab commented 6 years ago

When you pass in just one sample, and use the joint workflow with CoordinatesOnly, you are asking it to find bubbles in the graph of just one sample. So by definition all you can spot are paralogs, or heterozygous sites.

If you want to catch times when the sample differs from the reference, use

--ref CoordinatesAndInCalling Check the figure here: https://www.ncbi.nlm.nih.gov/pubmed/23172865

Am I making sense? Have I missed something?