itsmeludo / PhylOligo

Bioinformatics / Explore oligonucleotide composition similarity between assembly contigs or scaffolds to detect contaminant DNA.
GNU General Public License v3.0
10 stars 2 forks source link

Interpretation of results: phyloselect.py -t -m hdbscan versus kmedoids #10

Closed ckeeling closed 5 years ago

ckeeling commented 5 years ago

Hello,

I used the preprocessor script to remove scaffolds less than 1kb, giving 49608 scaffolds.

My analysis using hdbscan resulted in 152 clusters that appear to be further clustered into 4 major clusters. How would you interpret this? I.e. does this suggest 4 organisms might be in the assembly? How does one identify which of the 151 clusters are in each major cluster? I.e. the cluster by x/y info for this graph?

data_tsne_reduc

Cluster sizes:

data_fasta_cl0.fa:0
data_fasta_cl1.fa:8
data_fasta_cl10.fa:42
data_fasta_cl100.fa:30
data_fasta_cl101.fa:0
data_fasta_cl102.fa:0
data_fasta_cl103.fa:0
data_fasta_cl104.fa:0
data_fasta_cl105.fa:0
data_fasta_cl106.fa:0
data_fasta_cl107.fa:426
data_fasta_cl108.fa:24
data_fasta_cl109.fa:61
data_fasta_cl11.fa:87
data_fasta_cl110.fa:15
data_fasta_cl111.fa:20
data_fasta_cl112.fa:127
data_fasta_cl113.fa:145
data_fasta_cl114.fa:21
data_fasta_cl115.fa:49
data_fasta_cl116.fa:22
data_fasta_cl117.fa:12
data_fasta_cl118.fa:9
data_fasta_cl119.fa:126
data_fasta_cl12.fa:39
data_fasta_cl120.fa:935
data_fasta_cl121.fa:24
data_fasta_cl122.fa:98
data_fasta_cl123.fa:270
data_fasta_cl124.fa:24
data_fasta_cl125.fa:73
data_fasta_cl126.fa:19
data_fasta_cl127.fa:0
data_fasta_cl128.fa:12
data_fasta_cl129.fa:46
data_fasta_cl13.fa:25
data_fasta_cl130.fa:202
data_fasta_cl131.fa:14
data_fasta_cl132.fa:0
data_fasta_cl133.fa:0
data_fasta_cl134.fa:0
data_fasta_cl135.fa:96
data_fasta_cl136.fa:7
data_fasta_cl137.fa:7
data_fasta_cl138.fa:45
data_fasta_cl139.fa:17
data_fasta_cl14.fa:95
data_fasta_cl140.fa:14
data_fasta_cl141.fa:253
data_fasta_cl142.fa:73
data_fasta_cl143.fa:74
data_fasta_cl144.fa:90
data_fasta_cl145.fa:43
data_fasta_cl146.fa:415
data_fasta_cl147.fa:222
data_fasta_cl148.fa:61
data_fasta_cl149.fa:162
data_fasta_cl15.fa:11
data_fasta_cl150.fa:518
data_fasta_cl151.fa:43
data_fasta_cl16.fa:8
data_fasta_cl17.fa:5
data_fasta_cl18.fa:9
data_fasta_cl19.fa:8
data_fasta_cl2.fa:11
data_fasta_cl20.fa:46
data_fasta_cl21.fa:15
data_fasta_cl22.fa:9
data_fasta_cl23.fa:12
data_fasta_cl24.fa:8
data_fasta_cl25.fa:197
data_fasta_cl26.fa:53
data_fasta_cl27.fa:9
data_fasta_cl28.fa:12
data_fasta_cl29.fa:9
data_fasta_cl3.fa:26
data_fasta_cl30.fa:59
data_fasta_cl31.fa:9
data_fasta_cl32.fa:18
data_fasta_cl33.fa:119
data_fasta_cl34.fa:18
data_fasta_cl35.fa:11
data_fasta_cl36.fa:91
data_fasta_cl37.fa:8
data_fasta_cl38.fa:28
data_fasta_cl39.fa:377
data_fasta_cl4.fa:32
data_fasta_cl40.fa:17
data_fasta_cl41.fa:97
data_fasta_cl42.fa:28
data_fasta_cl43.fa:111
data_fasta_cl44.fa:12
data_fasta_cl45.fa:7
data_fasta_cl46.fa:62
data_fasta_cl47.fa:11
data_fasta_cl48.fa:7
data_fasta_cl49.fa:260
data_fasta_cl5.fa:13
data_fasta_cl50.fa:5
data_fasta_cl51.fa:55
data_fasta_cl52.fa:8
data_fasta_cl53.fa:11
data_fasta_cl54.fa:7
data_fasta_cl55.fa:84
data_fasta_cl56.fa:6
data_fasta_cl57.fa:8
data_fasta_cl58.fa:6
data_fasta_cl59.fa:41
data_fasta_cl6.fa:7
data_fasta_cl60.fa:10
data_fasta_cl61.fa:7
data_fasta_cl62.fa:10
data_fasta_cl63.fa:6
data_fasta_cl64.fa:176
data_fasta_cl65.fa:33
data_fasta_cl66.fa:7
data_fasta_cl67.fa:19
data_fasta_cl68.fa:49
data_fasta_cl69.fa:7
data_fasta_cl7.fa:83
data_fasta_cl70.fa:10
data_fasta_cl71.fa:6
data_fasta_cl72.fa:1
data_fasta_cl73.fa:0
data_fasta_cl74.fa:7
data_fasta_cl75.fa:18
data_fasta_cl76.fa:13
data_fasta_cl77.fa:19
data_fasta_cl78.fa:8
data_fasta_cl79.fa:42
data_fasta_cl8.fa:47
data_fasta_cl80.fa:46
data_fasta_cl81.fa:24
data_fasta_cl82.fa:663
data_fasta_cl83.fa:6
data_fasta_cl84.fa:25
data_fasta_cl85.fa:10
data_fasta_cl86.fa:113
data_fasta_cl87.fa:19
data_fasta_cl88.fa:585
data_fasta_cl89.fa:588
data_fasta_cl9.fa:10
data_fasta_cl90.fa:1001
data_fasta_cl91.fa:19
data_fasta_cl92.fa:151
data_fasta_cl93.fa:392
data_fasta_cl94.fa:19
data_fasta_cl95.fa:93
data_fasta_cl96.fa:19
data_fasta_cl97.fa:255
data_fasta_cl98.fa:34
data_fasta_cl99.fa:148
data_fasta_unclust.fa:505

When I run it using kmedoids, I get only 8 clusters, but the graph looks quite similar in structure, but the clusters seem to be dispersed between the major clusters.

data_tsne_reduc kmedoids

data_fasta_cl0.fa:4580
data_fasta_cl1.fa:2830
data_fasta_cl2.fa:4882
data_fasta_cl3.fa:2655
data_fasta_cl4.fa:6098
data_fasta_cl5.fa:3599
data_fasta_cl6.fa:2673
data_fasta_cl7.fa:4243
itsmeludo commented 5 years ago

Hi, the shape of the clusters is odd. They are a bit too perfect and equally distant. May I ask what was the initial sequencing? The clustering not working beautifully is quite common. I would suggest using an Eucledean distance and either increasing the kmer size (to 5 or 6 be carefull of the computing time) or using spaced-words. May I ask for a small subset of your data and check on my installation?

If the shape of the clusters are real, I would assume noisy data somehow equally distant in the kmer hyperspace. If not an artifact I could suspect high repeat content and maybe, if not contamination, symbiosis or hybridization. Eukaryotic organism? plant? The horizontal line beetwen the two clusters could be chimera scaffolds, into which you would have put contigs for 2 species. My follow up would be to aggregate the 151 hdbscan clusters into 4 by making a pairwise distance matrix of the clusters. either with R or with a trick in phyloligo: by concatenating all the sequences of a cluster in a pseudo scaffold and making the pairwaise distance matrix of the 151 pseudoscaffolds in phyloligo and phyloselect.R. after that, I would take the initial scaffold sequences and group them into 4 clusters according to the cluster clusterisation.

In the past we used our webserver GOHTAM to tell which species each cluster looked like or was close to. Unfortunately, it is not available anymore. I still have the database, so I could run the analysis and tell you which is what with only a fraction of your dataset. If so please contact me on my academic adress.

Cheers, ludovic.

ckeeling commented 5 years ago

Thank you for your comments and suggestions @itsmeludo. Before seeing your comments, I tried using a higher scaffold length cut-off in the preprocessing. With 10,000 cut-off, there is only 4 clusters (one major, one minor, and two trace). Looks much more reasonable.

e.g. data_tsne_reduc

With 5,000 cut-off, there is 13 clusters, but again one major, one minor, and just more trace clusters, although the unclust set is the largest.

data_tsne_reduc 5k_k4

I will explore the suggestion you make. Regards, Chris

itsmeludo commented 5 years ago

Hi, this looks much better, you definitely sequenced two types of organisms. Could also organellar DNA. Sometimes blast is enough to find what's what.

ckeeling commented 5 years ago

Thanks @itsmeludo, in fact three species are there (target, bacteria, and another eukaryote)! Using phyloselect.py rather than phyloselect.R (which I am struggling with libimf.so errors), is there a way to relax the stringency for cluster grouping such that there are fewer "unclustered" in the regions that surround the defined clusters?

itsmeludo commented 5 years ago

Hi Chris, in phyloselect.py there are aruments to control the clustering, for hdbscan there a minimum sample size and a minimum cluster size. Check the doc of hdbscan. There is also a interactive option to change the parameters on the fly and visualize the changes on the figure.

Anyway, phyloligo was designed to cope with approximative clustering. In fact, the next step is just to cure clusters from empty ones and overlapping ones. What will be performed in contaminate.R is to learn the kmer profile from one of the species and extract it from the whole data. The kmer profile is quite conserved in a species so only a fraction of the sequences are needed to sample the profile. So even with a poorly sensitive clustering ( though highly specific) you will split your dataset with a very high accuracy. You don't need to improve you clustering, because it is not the final output of the phyloligo pipeline.

If you have 3 sepcies, run contalocate.r successively.

You also may rerun a phyloselect.py on the subtracted data to check the efficiency.

I had to look very carefully for the 3rd species on your plot. It seems to appear in grey just under the legend "cl7" so it's not in a cluster. What to do in that case is to run a first contalocate.r with specifying your purple very diverse eukaryotic species. After subtraction, the remaining two species should be in clusters. Run another contalocate to split them and you should be done with the sorting of your genomes!

the red cluster could look like. Both could be bacteria or fungi but dependent on the species and their genomic micro composition variabily, I cannot tell from the graph.

ckeeling commented 5 years ago

OK, thanks @itsmeludo, it wasn't clear from the examples in the suppl. doc. that contralocate.R is run after phyloselect.py too (not just with phyloselect.R). The cluster for the third organism is the dark blue cluster (cl0) above the main yellow cluster (cl2) in the previous figure.

When I run contalocate.R, it halts after starting the gff file with the following error:

Error in if (ifelse(is.na(data[["Select_conta"]][i + 1]), test <- 0, test <- data[["Select_conta"]][i +  : 
  missing value where TRUE/FALSE needed
Execution halted

This code appears in line 178 of this script (and commented out in line 177). The .dist files appear to have been properly generated. Any suggestions on why I might be getting this error?

itsmeludo commented 5 years ago

Could you give your command line and minimal data so that I can reproduce the error?

I've never seen this error.

Le sam. 23 mars 2019 à 16:36, Chris Keeling notifications@github.com a écrit :

OK, thanks @itsmeludo https://github.com/itsmeludo, it wasn't clear from the examples in the suppl. doc. that contralocate.R is run after phyloselect.py too (not just with phyloselect.R). The cluster for the third organism is the dark blue cluster (cl0) above the main yellow cluster (cl2) in the previous figure.

When I run contalocate.R, it halts after starting the gff file with the following error:

Error in if (ifelse(is.na(data[["Select_conta"]][i + 1]), test <- 0, test <- data[["Select_conta"]][i + : missing value where TRUE/FALSE needed Execution halted

This code appears in line 178 of this script (and commented out in line 177). The .dist files appear to have been properly generated. Any suggestions on why I might be getting this error?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/itsmeludo/PhylOligo/issues/10#issuecomment-475879438, or mute the thread https://github.com/notifications/unsubscribe-auth/ARM684-YPfBxRsgFU_ULvDfwGWyyL27uks5vZkoPgaJpZM4b62Iy .

ckeeling commented 5 years ago

Here is the command. (filtered sequences larger than 100k for the assembly in this case)

contalocate.R -i preprocess_out_100k.fa -r target_training.fa -c contaminant_training.fa -W Out_KL/ -d KL

This is the command that is printed shortly after:

[1] "bash -lic \"ionice -c2 -n3 Kount.py -i preprocess_out_100k.fa -W Out_KL/ -w 5000 -t 500 -c contaminant_training.fa -r target_training.fa -d KL -n 0.5\""

I didn't mention it before, because it seems to be a known bug that one doesn't need to be concerned with, but I also received the following message 4x much earlier along in the process.

/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/externals/loky/backend/popen_loky_posix.py:77: RuntimeWarning: divide by zero encountered in log
  if pid == self.pid:
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/externals/loky/backend/popen_loky_posix.py:77: RuntimeWarning: invalid value encountered in multiply
  if pid == self.pid:

How can I send you some minimal data?

ckeeling commented 5 years ago

I've been trying to understand why I'm getting this message. I have been using the same data that generated the third figure above (4 clusters from 10,000 cut-off), but JSD with contalocate.R. It seems that the double threshold failed to find any contaminant scaffolds:

>data[["Select_conta"]]
integer(0)

There are non-zero values for the two thresholds, from Humans are not worthy:

> threshold_conta
[1] 6.491595
> threshold_host
[1] 30.10149

However, maybe the threshold_host is too high? As I understand your double threshold test to obtain data[["Select_conta"]], both thresholds should optimally be between the two peaks (contra and target), correct? Here are the plots:

image

image

Is the host threshold saying the target and contamination are too close to separate? Thoughts @itsmeludo or @T-B-F ?

ckeeling commented 5 years ago

Rather than letting the script find the thresholds automatically, manually setting the thresholds with the -m option of contalocate.R works.

itsmeludo commented 5 years ago

Hi Chris! The threshold definition system is actually roubles by the multicontamination. On the first graph you can see 2 modes and the threshold should be on the left of the first, in the valley. On the second graph, the threshold is a bit high but it should be too much of an issue once the first threshold set up properly. It can still be lowered a bit, closer to the base of the base of the peak. Distribution of distances at 0 is a little odd but could come from exact kmer profiles from different or exact sequences, for example from duplication or repeated elements. It can also happen when some kmers are not observed, making a division by 0. This is handled by setting a value of 0 instead of nan. Other Thant that and using the manual threshold for this case, the system should would with thresholds in the valley and just after the peak for the second graph as in the documentation. Once you have subtracted the sequences from one contamination (or one species for metagenomics) run the a second time the subtraction for the second species with its own profile. You should have you three species separated :)