leylabmpi / SynTracker

SynTracker is a pipeline, designated to determine the biological relatedness of conspecific strains (microbial strains of the same species) using genome synteny.
24 stars 5 forks source link

No avg_synteny_scores files are generated #9

Open asierFernandezP opened 4 days ago

asierFernandezP commented 4 days ago

Dear all,

First of all thanks for developing this tool. I am currently trying to run it using a set of reference viral genomes (identified from gut metagenomes) and the metagenomic contigs from multiple samples (as the target genomes).

The command that I used is the following:

# Run SynTracker with default parameters 
python /scratch/hb-llnext/conda_envs/SynTracker_env/SynTracker/syntracker.py \
    -target ${contig_dir} \
    -ref ${reference_genomes_dir} \
    -out ${output_dir} \
    --identity 97 \
    --coverage 70 \
    -length 5000

The example output in the log file for one of the genomes is:

Processing reference genome GPD_576
Found central regions. They are located in: /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/GPD_576/central_regions/
Running BLAST search for each region of the central regions...
All processes in batch number 1 finished successfully

BLAST search for all the regions finished successfully

Removing the temporary folder /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/GPD_576/tmp/

Starting synteny analysis for genome GPD_576

Running the following Rscript command:
Rscript /scratch/hb-llnext/conda_envs/SynTracker_env/SynTracker/syntracker_R_scripts/SynTracker.R GPD_576 /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/combined_targets/sample_names_dictionary.tab /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/GPD_576/blastdbcmd_output/ /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/GPD_576/final_output/ /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/summary_output/ /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/GPD_576/R_temp/ /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/GPD_576/R_intermediate_objects/ 1 False 128 /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/SynTracker_log.txt

Found 12 regions that haven't been processed yet

Going to run synteny analysis using Decipher for 12 regions...

Synteny analysis for region GPD_576_45000_46000 finished successfully

Synteny scores calculation for region GPD_576_45000_46000 finished successfully

Synteny analysis for region GPD_576_15000_16000 finished successfully

Synteny scores calculation for region GPD_576_15000_16000 finished successfully

Synteny analysis for region GPD_576_20000_21000 finished successfully

Synteny scores calculation for region GPD_576_20000_21000 finished successfully

Synteny analysis for region GPD_576_5000_6000 finished successfully

Synteny scores calculation for region GPD_576_5000_6000 finished successfully

Synteny analysis for region GPD_576_10000_11000 finished successfully

Synteny scores calculation for region GPD_576_10000_11000 finished successfully

Synteny analysis for region GPD_576_40000_41000 finished successfully

Synteny scores calculation for region GPD_576_40000_41000 finished successfully

Synteny analysis for region GPD_576_30000_31000 finished successfully

Synteny scores calculation for region GPD_576_30000_31000 finished successfully

Synteny analysis for region GPD_576_55000_56000 finished successfully

Synteny scores calculation for region GPD_576_55000_56000 finished successfully

Synteny analysis for region GPD_576_25000_26000 finished successfully

Synteny scores calculation for region GPD_576_25000_26000 finished successfully

Synteny analysis for region GPD_576_50000_51000 finished successfully

Synteny scores calculation for region GPD_576_50000_51000 finished successfully

Synteny analysis for region GPD_576_35000_36000 finished successfully

Synteny scores calculation for region GPD_576_35000_36000 finished successfully

Synteny analysis for region GPD_576_0_1000 finished successfully

Synteny scores calculation for region GPD_576_0_1000 finished successfully

Number of processed regions (not necessarily valid:) 12 

Total number of processed regions in current run and previous runs (not necessarily valid:) 12 

The processing of genome GPD_576 completed successfully

The tool seems to run without any problems but:

There seems to be a problem when computaing the average scores (although no error is reported). Would you recommend changing/adding any parameters?

In this case, as a test, I used 100 reference viral genomes (and ~400 assemblies) and only 12 of the genome output folders have actually computed results (I guess most of the viruses are simply not present in any sample or only in 1 of them and no comparison is possible - is that correct?)

Best, Asier

henav commented 2 days ago

Hi Asier, Thanks for using SynTracker!

I’ll start the second issue – yes you are correct. Most viruses are not abundant enough to be compared across samples. Probably these less abundant viruses were only detected in one (or less) samples.

As for your first issue – not getting any of the “avg_synteny_scores[subsampling_length]_regions.csv” files: the reason for that is that the total length, even with the lowest subsampling value (i.e., number of regions), is still longer than your viral genomes. Therefore, even if a genome is detected and compared, it is excluded from the final output, as the number of available regions is still too low. In those cases, when analyzing very short genomes, the[--avg_all] flag should be used. Then, the tool will generate another type of final per-genome files, based on all available regions. 12 in the example you posted. Thanks for directing me to that, We will make this option a default.

If running the tool again will take too much time, please let me know and I'll help with an alternative solution. Cheers, Hagay

asierFernandezP commented 11 hours ago

Thank you Hagay!

I will run then adding the -avg_all option. Regarding the time, as I have quite a lot of metagenomic samples and viral genomes, it takes indeed a long time to run. Would it be possible to run it in batches (e.g. instead of 1000 viral genomes against all metagenomes, compare batches of 100 genomes vs all metagenomes to speed up the process, and then merge the results?). As far as I understand, no comparisons are made between different viral genomes so this should not affect the final results. Am I correct?