Closed asierFernandezP closed 1 month ago
Hi Asier, Thanks for using SynTracker!
I’ll start the second issue – yes you are correct. Most viruses are not abundant enough to be compared across samples. Probably these less abundant viruses were only detected in one (or less) samples.
As for your first issue – not getting any of the “avg_synteny_scores[subsampling_length]_regions.csv” files: the reason for that is that the total length, even with the lowest subsampling value (i.e., number of regions), is still longer than your viral genomes. Therefore, even if a genome is detected and compared, it is excluded from the final output, as the number of available regions is still too low. In those cases, when analyzing very short genomes, the[--avg_all] flag should be used. Then, the tool will generate another type of final per-genome files, based on all available regions. 12 in the example you posted. Thanks for directing me to that, We will make this option a default.
If running the tool again will take too much time, please let me know and I'll help with an alternative solution. Cheers, Hagay
Thank you Hagay!
Running it adding the -avg_all
option seems indeed to solve this issue!
Regarding the time, as I have quite a lot of metagenomic samples and viral genomes, it takes indeed a long time to run. I was considering to run it in batches (e.g. instead of 1000 viral genomes against all metagenomes, compare batches of 100 genomes vs all metagenomes to speed up the process, and then merge the results?). As far as I understand, sppliting the reference genomes into batche swould not affect the results, right? However, this process would still be quite slow as the blast DB would need to be generated again for each of the batches. Is there a better way to do this?
Hi Asier, Great, I'm happy that the issue is solved! In terms of batching - yes, this is a very good solution, and should be executed just as you described - take batches of reference genomes, and run them against the entire collection of samples. For the typical user (i.e., analyzing bacterial genomes) it wouldn't necessarily improve the performance a lot, but in your case (short genomes), without going into too much details, I expect to see a significant improvement. Cheers, Hagay
Hi Hagay,
As I mentioned, when running it in batches there is the problem that the BLAST DB would need to be generated for all the genomes in each separate job/batch, which is not feasible in terms of space / time when dealing with a large number of genomes, as it is my case. Could this DB be generated in advance and be passed as an argument instead of having to generate it every time?
Hi, Yes, the size of multiple blast DBs could indeed be a limitation. We are working on a solution to that, but with the current version it is not possible to use pre-generated blast DB. Sorry about that...
Hi Asier,
Following our discussion we created a new version of SynTracker. Please take a look at it as it allows creating the blast DB only once, and then running the tool in separate runs without recreating the database. I think that in your case it would really help.
Cheers, Hagay
Thanks! It is indeed really helpful.
Best, Asier
Dear all,
First of all thanks for developing this tool. I am currently trying to run it using a set of reference viral genomes (identified from gut metagenomes) and the metagenomic contigs from multiple samples (as the target genomes).
The command that I used is the following:
The example output in the log file for one of the genomes is:
The tool seems to run without any problems but:
There seems to be a problem when computaing the average scores (although no error is reported). Would you recommend changing/adding any parameters?
In this case, as a test, I used 100 reference viral genomes (and ~400 assemblies) and only 12 of the genome output folders have actually computed results (I guess most of the viruses are simply not present in any sample or only in 1 of them and no comparison is possible - is that correct?)
Best, Asier