Open alxsimon opened 3 years ago
Hi!
What is the size of your dataset? it seems more likely it is too large for your ram. Could you try to subsample to 500 contigs with phylopreprocess?
Best, Ludovic
My genome is around 2 Gbp with 100k scaffolds, I have 512 Go of RAM.
Your are right about the RAM limitation, a subsample of 500 scaffolds works (with some deprecation messages).
Would the command without --large option work? I tried it but after 24h it wasn't finished, I abandoned this run.
Thanks, Alexis
Felt like it. Albeit the genome size would not be an issue even if it was x times larger, the 100k scaffolds is uncharted territory for phyloligo. I achieved results for up to 1000-2000 scaffolds over few days on a Terabyte machine, although RAM was not the issue for me, but rather the computational time. Remember that this implementation is an n*n comparison, so 100k² is a lot to store and process. I doubt removing the --large would help.
On the biological side, 2Gb in 100k might suggest high repeated content such as transposable elements, problems of coverage or assembly.
On your 500 scaffolds run, you might already be able to distinguish contaminants. If not, try several random subsamplings. I could help if needed to manually identify potential contaminants or species neighborhood with the database of our previous server GOHTAM, which is now unfortunately offline.
If you ever find contaminants, my advice is to gather sequences from them, and tag them (we call them prototypes). Then, proceed to scan the whole assembly a 1000 contig at a time (only a 100 launches, not too tough) and whevever new scaffolds cluster with your tagged contaminant sequences, bin them out.
I'm affraid this large genome procedure and prototype approach was never implemented to be automated, If you feel to contribute, be the guest! Ideally, prototypes binning (aggregating new scaffolds) should be done iteratively to avoid the nn comparison into a more feasible n~m, where m is the number of differents prototypes, corresponding to the numbers of species. On our data we got evidence for hybrid scaffolds, defeating the safeness of the approach, hence the fallback into subsampling and iterative approach.
cheers, Ludovic
Thank you.
To give a little bit more context, I am working on the assembly of 3 species of mussels, 2 of them have not a lot of contamination and the genome size is as expected around 1.6Gb split in around 60k scaffolds (you may be right about TEs and repeats, particularly abundant in molluscs).
The assembly, gave me a size of 2Gb, 100k scaffolds as I said and it is clear from the Blobtoolkit analysis that I have a large amount of contamination, that amount to around 400Mb of sequences.
Would this be a viable approach?:
Yes of course! This is the intended purpose. contalocate will work even if the sampling of contaminants is partial, incomplete or even a fraction of it. The basis of that is that kmer composition is somewhat conserved in a species. you could probably identify most if not the whole contaminant with a prototype containing probably less than 5-10% of your contaminant's sequences (also depending on the species of course).
Given your numbers, you expect 40k scaffolds conta in 100k, so i bet in your 500 random subsampling, you have enough conta sampling to build a prototype, considern you migh have several contaminants and that you might have to subdivide into several prototypes.
On the 500, relaunch without the --large so you can use contaselect.R, wich can be helpfull to extract prototypes interactively.
Let me know, it always get's interesting!
Thank you for your help, I will try that.
Hi, I wanted to try the phyloligo method on my data but encountered the following error.
I installed it through conda.
I obtain the error when using
--large memmap
.