leylabmpi / SynTracker

SynTracker is a pipeline, designated to determine the biological relatedness of conspecific strains (microbial strains of the same species) using genome synteny.
25 stars 6 forks source link

No avg_synteny_scores files are generated #9

Closed asierFernandezP closed 1 month ago

asierFernandezP commented 1 month ago

Dear all,

First of all thanks for developing this tool. I am currently trying to run it using a set of reference viral genomes (identified from gut metagenomes) and the metagenomic contigs from multiple samples (as the target genomes).

The command that I used is the following:

# Run SynTracker with default parameters 
python /scratch/hb-llnext/conda_envs/SynTracker_env/SynTracker/syntracker.py \
    -target ${contig_dir} \
    -ref ${reference_genomes_dir} \
    -out ${output_dir} \
    --identity 97 \
    --coverage 70 \
    -length 5000

The example output in the log file for one of the genomes is:

Processing reference genome GPD_576
Found central regions. They are located in: /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/GPD_576/central_regions/
Running BLAST search for each region of the central regions...
All processes in batch number 1 finished successfully

BLAST search for all the regions finished successfully

Removing the temporary folder /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/GPD_576/tmp/

Starting synteny analysis for genome GPD_576

Running the following Rscript command:
Rscript /scratch/hb-llnext/conda_envs/SynTracker_env/SynTracker/syntracker_R_scripts/SynTracker.R GPD_576 /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/combined_targets/sample_names_dictionary.tab /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/GPD_576/blastdbcmd_output/ /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/GPD_576/final_output/ /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/summary_output/ /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/GPD_576/R_temp/ /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/GPD_576/R_intermediate_objects/ 1 False 128 /scratch/hb-llnext/PentaChiliadal/DOWNSTREAM_ANALYSIS/8_STRAIN_TRANSMISSION/SYNTRACKER/SYNTRACKER_RESULTS/SynTracker_log.txt

Found 12 regions that haven't been processed yet

Going to run synteny analysis using Decipher for 12 regions...

Synteny analysis for region GPD_576_45000_46000 finished successfully

Synteny scores calculation for region GPD_576_45000_46000 finished successfully

Synteny analysis for region GPD_576_15000_16000 finished successfully

Synteny scores calculation for region GPD_576_15000_16000 finished successfully

Synteny analysis for region GPD_576_20000_21000 finished successfully

Synteny scores calculation for region GPD_576_20000_21000 finished successfully

Synteny analysis for region GPD_576_5000_6000 finished successfully

Synteny scores calculation for region GPD_576_5000_6000 finished successfully

Synteny analysis for region GPD_576_10000_11000 finished successfully

Synteny scores calculation for region GPD_576_10000_11000 finished successfully

Synteny analysis for region GPD_576_40000_41000 finished successfully

Synteny scores calculation for region GPD_576_40000_41000 finished successfully

Synteny analysis for region GPD_576_30000_31000 finished successfully

Synteny scores calculation for region GPD_576_30000_31000 finished successfully

Synteny analysis for region GPD_576_55000_56000 finished successfully

Synteny scores calculation for region GPD_576_55000_56000 finished successfully

Synteny analysis for region GPD_576_25000_26000 finished successfully

Synteny scores calculation for region GPD_576_25000_26000 finished successfully

Synteny analysis for region GPD_576_50000_51000 finished successfully

Synteny scores calculation for region GPD_576_50000_51000 finished successfully

Synteny analysis for region GPD_576_35000_36000 finished successfully

Synteny scores calculation for region GPD_576_35000_36000 finished successfully

Synteny analysis for region GPD_576_0_1000 finished successfully

Synteny scores calculation for region GPD_576_0_1000 finished successfully

Number of processed regions (not necessarily valid:) 12 

Total number of processed regions in current run and previous runs (not necessarily valid:) 12 

The processing of genome GPD_576 completed successfully

The tool seems to run without any problems but:

There seems to be a problem when computaing the average scores (although no error is reported). Would you recommend changing/adding any parameters?

In this case, as a test, I used 100 reference viral genomes (and ~400 assemblies) and only 12 of the genome output folders have actually computed results (I guess most of the viruses are simply not present in any sample or only in 1 of them and no comparison is possible - is that correct?)

Best, Asier

henav commented 1 month ago

Hi Asier, Thanks for using SynTracker!

I’ll start the second issue – yes you are correct. Most viruses are not abundant enough to be compared across samples. Probably these less abundant viruses were only detected in one (or less) samples.

As for your first issue – not getting any of the “avg_synteny_scores[subsampling_length]_regions.csv” files: the reason for that is that the total length, even with the lowest subsampling value (i.e., number of regions), is still longer than your viral genomes. Therefore, even if a genome is detected and compared, it is excluded from the final output, as the number of available regions is still too low. In those cases, when analyzing very short genomes, the[--avg_all] flag should be used. Then, the tool will generate another type of final per-genome files, based on all available regions. 12 in the example you posted. Thanks for directing me to that, We will make this option a default.

If running the tool again will take too much time, please let me know and I'll help with an alternative solution. Cheers, Hagay

asierFernandezP commented 1 month ago

Thank you Hagay!

Running it adding the -avg_all option seems indeed to solve this issue!

Regarding the time, as I have quite a lot of metagenomic samples and viral genomes, it takes indeed a long time to run. I was considering to run it in batches (e.g. instead of 1000 viral genomes against all metagenomes, compare batches of 100 genomes vs all metagenomes to speed up the process, and then merge the results?). As far as I understand, sppliting the reference genomes into batche swould not affect the results, right? However, this process would still be quite slow as the blast DB would need to be generated again for each of the batches. Is there a better way to do this?

henav commented 1 month ago

Hi Asier, Great, I'm happy that the issue is solved! In terms of batching - yes, this is a very good solution, and should be executed just as you described - take batches of reference genomes, and run them against the entire collection of samples. For the typical user (i.e., analyzing bacterial genomes) it wouldn't necessarily improve the performance a lot, but in your case (short genomes), without going into too much details, I expect to see a significant improvement. Cheers, Hagay

asierFernandezP commented 1 month ago

Hi Hagay,

As I mentioned, when running it in batches there is the problem that the BLAST DB would need to be generated for all the genomes in each separate job/batch, which is not feasible in terms of space / time when dealing with a large number of genomes, as it is my case. Could this DB be generated in advance and be passed as an argument instead of having to generate it every time?

henav commented 1 month ago

Hi, Yes, the size of multiple blast DBs could indeed be a limitation. We are working on a solution to that, but with the current version it is not possible to use pre-generated blast DB. Sorry about that...

henav commented 3 weeks ago

Hi Asier,

Following our discussion we created a new version of SynTracker. Please take a look at it as it allows creating the blast DB only once, and then running the tool in separate runs without recreating the database. I think that in your case it would really help.

Cheers, Hagay

asierFernandezP commented 3 weeks ago

Thanks! It is indeed really helpful.

Best, Asier