snayfach / MIDAS

An integrated pipeline for estimating strain-level genomic variation from metagenomic data
http://dx.doi.org/10.1101/gr.201863.115
GNU General Public License v3.0
124 stars 52 forks source link

Interpreting output from strain_tracking.py track_markers #106

Open dadahan opened 4 years ago

dadahan commented 4 years ago

I am running the strain tracking pipeline and am curious if someone is familiar with results interpretation. I could not find the relevant info on this repo or the original pub. I am using a slightly modified version of MIDAS that incorporates the recent IGGSearch database.

Scripts run until output:

run_iggsearch.py search \
    --m1 "${FASTQ}/read1.fastq.gz" \
    --m2 "${FASTQ}/read2.fastq.gz" \
    --db_dir "${IGG_DB_PATH}" \
    --outdir "${LOCAL_OUTPUT}/${SAMPLE_NAME}/iggsearch" \
    --threads "${coreNum}"

MIDAS-IGGdb/run.sh snps \
  "${LOCAL_OUTPUT}/${SAMPLE_NAME}" \
 -1 "${FASTQ}/read1.fastq.gz" \
 -2 "${FASTQ}/read2.fastq.gz" \
 -t "${coreNum}" \
 --dbtoc ${IGG_DB_PATH}/metadata/species_info.tsv \
 -d ${IGG_DB_PATH} 

#SAMPLES is a directory with all output samples from MIDAS-IGGdb/run.sh snps

MIDAS-IGGdb/run.sh merge snps "${LOCAL_OUTPUT}" \
 -i "${SAMPLES}" \
 -t dir \
 --threads "${coreNum}" \
 --core_sites

# example of strain_tracking on one OTU

strain_tracking.py id_markers --indir OTU-04863 --out OTU-04863_id_markers --samples mom1,mom2

strain_tracking.py track_markers --indir OTU-04863 --out OTU-04863_marker_sharing --markers OTU-04863_id_markers

Resulting output:

sample1 sample2 count1  count2  count_both      count_either
mom1 mom2 12426   8167    4       20589
mom1 infant1  12426   11158   11125   12459
mom1 infant2  12426   11588   0       24014
mom2 infant1  8167    11158   10      19315
mom2 infant2  8167    11588   8138    11617

The output when running track_markers reported:

0 sample pairs processed

This warning is really throwing me off. I used the --core_sites flag as suggested in #57 .

I do not understand the sample pairs processed output. My interpretation is that mom1 and infant1 share 11125/12459 (89%) of their marker alleles and mom2 and infant2 share 8138/11617 (70%) of their marker alleles and are therefore strain sharing this specie and mom1 and infant2 and mom2 and infant1 share less than 5% of their marker alleles and are therefore not strain sharing this specie. Is this the correct interpretation? Please let me know if I can elaborate anywhere. Thank you in advance.

zhaoc1 commented 4 years ago

I believe the output from track_markers just indicate the progress of sharing-alleles-compute. And I think your understanding about the actual output TSV file is correct.