Error for the --large method

alxsimon commented 3 years ago

Hi, I wanted to try the phyloligo method on my data but encountered the following error.

I installed it through conda.

I obtain the error when using --large memmap.

❯ phyloligo.py -i <(zcat assembly_10x/results/fasta/edu_v5.pseudohap.fasta.gz) -d JSD -o edu_v5.JSD.h5py.mat -c 4 --method joblib --large h5py

/opt/miniconda3/envs/phyloligo/lib/python3.7/site-packages/sklearn/utils/__init__.py:4: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
  from collections import Sequence
Using pattern 1111
Computing frequencies
Computing Pairwise distances
Traceback (most recent call last):
  File "/opt/miniconda3/envs/phyloligo/bin/phyloligo.py", line 1072, in <module>
    ret = main()
  File "/opt/miniconda3/envs/phyloligo/bin/phyloligo.py", line 1056, in main
    params.dist, params.threads_max, params.freqchunksize, params.workdir)
  File "/opt/miniconda3/envs/phyloligo/bin/phyloligo.py", line 546, in compute_distances
    compute_distances_h5py(freq_name, out_file, metric=dist, n_jobs=threads_max)
  File "/opt/miniconda3/envs/phyloligo/bin/phyloligo.py", line 516, in compute_distances_h5py
    Parallel(n_jobs=n_jobs, verbose=0)(fd(dist_folder, freq_name, s, metric) for s in gen_even_slices(size, n_jobs*SCALING)) # some scaling
  File "/opt/miniconda3/envs/phyloligo/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 789, in __call__
    self.retrieve()
  File "/opt/miniconda3/envs/phyloligo/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 699, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/opt/miniconda3/envs/phyloligo/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f198aa98f50>'. Reason: 'PicklingError("Can't pickle <class 'MemoryError'>: it's not the same object as builtins.MemoryError")'

itsmeludo commented 3 years ago

Hi!

What is the size of your dataset? it seems more likely it is too large for your ram. Could you try to subsample to 500 contigs with phylopreprocess?

Best, Ludovic

alxsimon commented 3 years ago

My genome is around 2 Gbp with 100k scaffolds, I have 512 Go of RAM.

Your are right about the RAM limitation, a subsample of 500 scaffolds works (with some deprecation messages).

Would the command without --large option work? I tried it but after 24h it wasn't finished, I abandoned this run.

Thanks, Alexis

itsmeludo commented 3 years ago

Felt like it. Albeit the genome size would not be an issue even if it was x times larger, the 100k scaffolds is uncharted territory for phyloligo. I achieved results for up to 1000-2000 scaffolds over few days on a Terabyte machine, although RAM was not the issue for me, but rather the computational time. Remember that this implementation is an n*n comparison, so 100k² is a lot to store and process. I doubt removing the --large would help.

On the biological side, 2Gb in 100k might suggest high repeated content such as transposable elements, problems of coverage or assembly.

On your 500 scaffolds run, you might already be able to distinguish contaminants. If not, try several random subsamplings. I could help if needed to manually identify potential contaminants or species neighborhood with the database of our previous server GOHTAM, which is now unfortunately offline.

If you ever find contaminants, my advice is to gather sequences from them, and tag them (we call them prototypes). Then, proceed to scan the whole assembly a 1000 contig at a time (only a 100 launches, not too tough) and whevever new scaffolds cluster with your tagged contaminant sequences, bin them out.

I'm affraid this large genome procedure and prototype approach was never implemented to be automated, If you feel to contribute, be the guest! Ideally, prototypes binning (aggregating new scaffolds) should be done iteratively to avoid the nn comparison into a more feasible n~m, where m is the number of differents prototypes, corresponding to the numbers of species. On our data we got evidence for hybrid scaffolds, defeating the safeness of the approach, hence the fallback into subsampling and iterative approach.

cheers, Ludovic

alxsimon commented 3 years ago

Thank you.

To give a little bit more context, I am working on the assembly of 3 species of mussels, 2 of them have not a lot of contamination and the genome size is as expected around 1.6Gb split in around 60k scaffolds (you may be right about TEs and repeats, particularly abundant in molluscs).

The assembly, gave me a size of 2Gb, 100k scaffolds as I said and it is clear from the Blobtoolkit analysis that I have a large amount of contamination, that amount to around 400Mb of sequences.

Would this be a viable approach?:

identify scaffolds of contaminants and host based on the blast output of Blobtoolkit / or identify sets based on half of the genome (this worked in RAM).
use those sets to run contaLocate on the full genome

itsmeludo commented 3 years ago

Yes of course! This is the intended purpose. contalocate will work even if the sampling of contaminants is partial, incomplete or even a fraction of it. The basis of that is that kmer composition is somewhat conserved in a species. you could probably identify most if not the whole contaminant with a prototype containing probably less than 5-10% of your contaminant's sequences (also depending on the species of course).

Given your numbers, you expect 40k scaffolds conta in 100k, so i bet in your 500 random subsampling, you have enough conta sampling to build a prototype, considern you migh have several contaminants and that you might have to subdivide into several prototypes.

On the 500, relaunch without the --large so you can use contaselect.R, wich can be helpfull to extract prototypes interactively.

Let me know, it always get's interesting!

alxsimon commented 3 years ago

Thank you for your help, I will try that.

itsmeludo / PhylOligo

Error for the --large method #14