Closed mbhall88 closed 3 years ago
Very rational. An important question which I think makes a lot of sense for the future of Atlas too.
In Methods #3, what do you mean? Is this the options you describe in Discussion?
Discussion #1 would have more weight if you can run this approach iteratively multiple times with a ton of random mixing between Illumina and Nanopore. Instead of qualitatively describing the difference between a single experiment of Nanopore / Illumina / random mix of both, you could look at a more general measure of diversity in clustering (difficult to define) between all those random mixing runs. If it's low, it's more reassuring for labs who can't predict what Illumina / Nanopore data they will have at hands in the future.
Discussion #3 I like this approach a lot. If I understand it well, it allows for threshold calibration. Since both nodes (Illumina and Nanopore consensus sequence right?) of each sample are in the same network, we can adjust the clustering SNP threshold at technology level rather than at network level. In a way, we adjust for the under-calling of Nanopore. That is a Nanopore sample can be clustered with an Illumina one at threshold "x" which would not necessarily lead to clustering between two Illumina samples. Varying those technology-specific thresholds within a single network with all the duplicates allows calibration so that every technology-mixed edges align to those of illumina alone (reference data set). Even if there's a lot "beyond the SNP threshold" this is a valid attempt at establishing specific thresholds for mixed technologies studies. Probably a lot of work...
What doo you think?
There is a barrier of me understanding the bioinformatics of it and a barrier of me writing what I have in mind in English ... :) I'm happy to jump on a side call if that helps.
In Methods #3, what do you mean? Is this the options you describe in Discussion?
Yes, it's the stuff in the discussion points.
Discussion #1 would have more weight if you can run this approach iteratively multiple times with a ton of random mixing between Illumina and Nanopore. Instead of qualitatively describing the difference between a single experiment of Nanopore / Illumina / random mix of both, you could look at a more general measure of diversity in clustering (difficult to define) between all those random mixing runs. If it's low, it's more reassuring for labs who can't predict what Illumina / Nanopore data they will have at hands in the future.
Understood. In terms of technical implementations, this approach of running random mixing is likely to be easier than options 2 and 3 also.
Discussion #3 I like this approach a lot. If I understand it well, it allows for threshold calibration. Since both nodes (Illumina and Nanopore consensus sequence right?) of each sample are in the same network, we can adjust the clustering SNP threshold at technology level rather than at network level. In a way, we adjust for the under-calling of Nanopore. That is a Nanopore sample can be clustered with an Illumina one at threshold "x" which would not necessarily lead to clustering between two Illumina samples. Varying those technology-specific thresholds within a single network with all the duplicates allows calibration so that every technology-mixed edges align to those of illumina alone (reference data set). Even if there's a lot "beyond the SNP threshold" this is a valid attempt at establishing specific thresholds for mixed technologies studies. Probably a lot of work...
What doo you think?
If I understand you correctly here, then yes, the idea is to try and find a threshold that gives the closest clusters to those produced by Illumina-only. Effectively the same analysis that we did to get the Nanopore threshold (which was very close to the Illumina anyway).
There is a barrier of me understanding the bioinformatics of it and a barrier of me writing what I have in mind in English ... :) I'm happy to jump on a side call if that helps.
I think you've explained yourself perfectly fine.
All up, I'm very excited about the addition of this - I think it makes the project even more impactful (and novel).
This is interesting. The question of whether you can use mixed tech for phylogenetic has been looked at a lot i think, but not really published on, so a solution to that (which is not what you're saying) would be of v wide interest . Leah has spent time on this prior to joining our group. Essentially she'd say that building a phylogeny from nanopore and illumina assemblies (eh with parsnp) works as far as tree structure is concerned but you get too long branches fir the nanopore samples.
That's background. For your clustering question, you want to be able to have a cast iron way of saying whether a certain clustering was acceptable compared with illumina/compass. Once you have that, you can apply it to mixed tech.
Ie you need two things
Evaluation methodology that says yes/no/how close a new clustering (eg bcftools with nanopore) is to being acceptable.
New clustering method for mixed tech.
2 is much easier and clearer if we can make 1 categorically clear
OK, so having caught up on quantifying cluster similarities https://github.com/mbhall88/head_to_head_pipeline/issues/65 i'm much more sanguine. In particular i did not expect to be able to use ~the same, or even precisely the same distance threshold in bcftools as we use in compass. @mbhall88 i look forward to talking on Monday, but some ideas.
Suppose we define a ratio r which ranges from 0.1 to 0.9 (say). This is going to be the proportion of genomes which are nanopore. Suppose for the moment that we fix the bcftools threshold== compass distance. Now suppose we follow the following
n=0
do {
for each sample, assign it randomly to nanopore with probability r (else illumina)
Calculate ACR5, ACP5 and add to an array
} while n<1000
So you basically do 1000 simulations, of which samples you picked to be nanopore and illumina, but always with r% being nanopore. Then look at a violin plot of how ACR5, ACP5 look.
Right, having done this, you can see how this varies with r.
Does this sound ok? Would give you pretty convincing data i think on how well you can cluster with mixed data
This sounds great. Very close to what I had in mind. I like the addition of playing with a ratio of allocations to the two techs!
I have just created the mixed distance matrix. So I will make a start on this on Monday. :rocket:
I knew it wouldn't take long before I could not track the specifics:) But I'm happy we all agree this is an important question.
The first thing we look at is the mixed technology "self-distance". By self-distance I mean the distance between a sample's Illumina and Nanopore consensus sequence. This should give us a good indication of how sane it is to compare distances across the two technologies.
The big outlier here is mada_1-33
with a self-distance of 53. When I compare its distance to all other samples (that passed QC) its closest match is to itself. So this suggests it is just discrepant and not suggestive of a sample swap. However, the caveat here is it could have a closer match with something that didn't pass QC.
On the whole though the bulk of the self-distances (median) are 0. The summary statistics for the distribution are
count 150.000000
mean 1.013333
std 4.451056
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 53.000000
Next, we look at the distance dotplot. On the x-axis we have the compass distance for the pair. On the y-axis we have the mixed distance. The difference between this dotplot and the one in #7 is the matrix is no symmetrical. That is, sample_a x sample_b is not the same as sample_b x sample_a. So we have twice as many points as we do in #7. The first sample in the pair indicates the Illumina consensus for that sample, and the second is the Nanopore for that sample. So, sample_a x sample_b is the distance between sample a's Illumina and sample b's Nnaopore.
The inset is the pairs where the Illumina distance is <= 100. This is a slight change to #7 in terms of the way I compute the line of best fit. I've been doing some reading and I think a Robust Linear Regression is more suited to our needs. In particular, we use the random sample consensus (RANSAC) variant.
Random sample consensus (RANSAC) is an iterative method to estimate parameters of a mathematical model from a set of observed data that contains outliers, when outliers are to be accorded no influence on the values of the estimates. Therefore, it also can be interpreted as an outlier detection method
I am open to discussion if you think this is a bad approach @iqbal-lab. The other option could be the Huber variant which only detects outliers in the response variable (mixed distance here) and just weights those outliers much lower rather than not considering them at all.
You can see on the inset dotplot a cloud of distinct outliers. These are attributed to 6 samples. I've shown all combinations of these 6 samples in the table below with their distances for all combinations of technologies. I've looked at these for an hour or so and can't seem to see a particular pattern that would suggest a sample swap. It just looks like the ONT data for a couple of the samples is just quite different to the Illumiuna of some of the others. Which is fine; I guess that's the point of this analysis in the first place.
sample A | sample B | Illu. dist. | ONT dist. | A Illu. x B ONT | A ONT x B Illu. |
---|---|---|---|---|---|
R26791 | R20260 | 20 | 19 | 20 | 66 |
R26791 | R27725 | 6 | 5 | 5 | 6 |
R26791 | R18043 | 12 | 11 | 11 | 12 |
R26791 | R20574 | 20 | 18 | 20 | 66 |
R26791 | R21408 | 21 | 3 | 5 | 67 |
R20260 | R27725 | 13 | 18 | 62 | 12 |
R20260 | R18043 | 20 | 21 | 69 | 19 |
R20260 | R20574 | 0 | 0 | 0 | 0 |
R20260 | R21408 | 13 | 6 | 8 | 12 |
R27725 | R18043 | 4 | 5 | 4 | 6 |
R27725 | R20574 | 13 | 13 | 14 | 62 |
R27725 | R21408 | 15 | 6 | 5 | 65 |
R18043 | R20574 | 20 | 18 | 20 | 69 |
R18043 | R21408 | 21 | 9 | 10 | 69 |
R20574 | R21408 | 13 | 8 | 8 | 12 |
Description of the mixed technology simulation:
This mostly follows Zam's previous comment, but I will lay it out in explicit terms here. If this doesn't make sense @simongrandjeanlapierre please speak up - if you don't understand it there will be a lot of people who don't and we want to make this analysis clear.
To begin with, we define a list of clustering thresholds (T) that we are interested in - in our case, T is [0, 2, 5, 12]
. We also define a list of ratios (R) that describe the mixture of Nanopore/Illumina data. For example, if we have a ratio of 0.75, then 75% of the data we select is Nanopore and the other 25% is Illumina. For each threshold (t) within T we do the following:
I have the infrastructure for running the simulations setup. There are just some pieces we need to decide on:
[0.1, 0.25, 0.5, 0.75, 0.9]
Lastly, here is an example of the visualisation (pending changes from the above points if necessary). I only ran it with N = 100.
Also, here is an example of the plot without the CNR
I think what you are currently doing (averaging over samples) is correct, as it will weigh more heavily towards the effect on big clusters.
Averaging over cluster averages is what I originally intended I think.
As to naming, if we wanted to be precise it would be Sample-Averaged Cluster-Recall?
Sorry, yes i think include CNR, lovely plots
Ratios are fine. Tempting to add 0.01, 0.05. 1000 is plenty.
Results from running 1000 simulations per ratio.
Summary statistics for the above plots. Of interest: for all ratios and threshold combinations, the median ACR is 1.0. Meaning, regardless of the Nanopore-to-Illumina ratio, for all thresholds we used, no sample from the truth clusters is missed.
value | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | |||
ratio | threshold | metric | ||||||||
0.01 | 0 | CNR | 1000.0 | 1.001086 | 0.021478 | 0.857143 | 1.000000 | 1.000000 | 1.000000 | 1.200000 |
SACP | 1000.0 | 0.997692 | 0.016703 | 0.846154 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.997077 | 0.020052 | 0.846154 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
2 | CNR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | |
SACP | 1000.0 | 0.999864 | 0.002153 | 0.965909 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
5 | CNR | 1000.0 | 0.998867 | 0.013887 | 0.923077 | 1.000000 | 1.000000 | 1.000000 | 1.090909 | |
SACP | 1000.0 | 0.999171 | 0.006908 | 0.928571 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.999100 | 0.007604 | 0.928571 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
12 | CNR | 1000.0 | 1.001333 | 0.008507 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.055556 | |
SACP | 1000.0 | 0.997979 | 0.010386 | 0.930189 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.999827 | 0.002441 | 0.965406 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
0.05 | 0 | CNR | 1000.0 | 1.001914 | 0.050367 | 0.857143 | 1.000000 | 1.000000 | 1.000000 | 1.200000 |
SACP | 1000.0 | 0.990154 | 0.034728 | 0.769231 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.987898 | 0.040636 | 0.717954 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
2 | CNR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | |
SACP | 1000.0 | 0.998227 | 0.007573 | 0.965909 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
5 | CNR | 1000.0 | 0.993728 | 0.033907 | 0.857143 | 1.000000 | 1.000000 | 1.000000 | 1.090909 | |
SACP | 1000.0 | 0.993049 | 0.019196 | 0.877554 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.993443 | 0.020012 | 0.871429 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
12 | CNR | 1000.0 | 1.007471 | 0.019719 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.117647 | |
SACP | 1000.0 | 0.988758 | 0.023649 | 0.873585 | 0.987423 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.998305 | 0.007472 | 0.965406 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
0.10 | 0 | CNR | 1000.0 | 1.004343 | 0.070142 | 0.857143 | 1.000000 | 1.000000 | 1.000000 | 1.200000 |
SACP | 1000.0 | 0.979769 | 0.048664 | 0.769231 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.975616 | 0.055949 | 0.717954 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
2 | CNR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | |
SACP | 1000.0 | 0.996557 | 0.010278 | 0.965909 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
5 | CNR | 1000.0 | 0.986347 | 0.047736 | 0.857143 | 0.923077 | 1.000000 | 1.000000 | 1.090909 | |
SACP | 1000.0 | 0.987037 | 0.025115 | 0.877554 | 0.982143 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.989143 | 0.025094 | 0.871429 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
12 | CNR | 1000.0 | 1.016582 | 0.028292 | 1.000000 | 1.000000 | 1.000000 | 1.055556 | 1.117647 | |
SACP | 1000.0 | 0.975084 | 0.033132 | 0.854717 | 0.943396 | 0.987423 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.996748 | 0.010101 | 0.965406 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
0.25 | 0 | CNR | 1000.0 | 0.998743 | 0.096474 | 0.857143 | 1.000000 | 1.000000 | 1.000000 | 1.200000 |
SACP | 1000.0 | 0.957692 | 0.066539 | 0.769231 | 0.923077 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.947950 | 0.077849 | 0.717954 | 0.871800 | 1.000000 | 1.000000 | 1.000000 | ||
2 | CNR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | |
SACP | 1000.0 | 0.990966 | 0.015053 | 0.965909 | 0.965909 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
5 | CNR | 1000.0 | 0.964256 | 0.060380 | 0.800000 | 0.923077 | 1.000000 | 1.000000 | 1.090909 | |
SACP | 1000.0 | 0.971654 | 0.034605 | 0.877554 | 0.948982 | 0.982143 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.978943 | 0.034639 | 0.871429 | 0.942857 | 1.000000 | 1.000000 | 1.000000 | ||
12 | CNR | 1000.0 | 1.035820 | 0.037932 | 1.000000 | 1.000000 | 1.055556 | 1.055556 | 1.117647 | |
SACP | 1000.0 | 0.945687 | 0.041868 | 0.842140 | 0.914642 | 0.943396 | 0.983823 | 1.000000 | ||
SACR | 1000.0 | 0.994327 | 0.012816 | 0.965406 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
0.50 | 0 | CNR | 1000.0 | 0.974371 | 0.108913 | 0.857143 | 0.857143 | 1.000000 | 1.000000 | 1.200000 |
SACP | 1000.0 | 0.941385 | 0.076085 | 0.769231 | 0.846154 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.928565 | 0.088733 | 0.717954 | 0.846154 | 1.000000 | 1.000000 | 1.000000 | ||
2 | CNR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | |
SACP | 1000.0 | 0.983261 | 0.017051 | 0.965909 | 0.965909 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
5 | CNR | 1000.0 | 0.936143 | 0.065786 | 0.800000 | 0.923077 | 0.923077 | 1.000000 | 1.090909 | |
SACP | 1000.0 | 0.952238 | 0.037418 | 0.877554 | 0.928571 | 0.948982 | 0.976189 | 1.000000 | ||
SACR | 1000.0 | 0.974271 | 0.037011 | 0.871429 | 0.928571 | 1.000000 | 1.000000 | 1.000000 | ||
12 | CNR | 1000.0 | 1.064108 | 0.041800 | 1.000000 | 1.055556 | 1.055556 | 1.117647 | 1.117647 | |
SACP | 1000.0 | 0.906405 | 0.043110 | 0.842140 | 0.861008 | 0.911321 | 0.930189 | 1.000000 | ||
SACR | 1000.0 | 0.991144 | 0.015105 | 0.965406 | 0.965406 | 1.000000 | 1.000000 | 1.000000 | ||
0.75 | 0 | CNR | 1000.0 | 0.924171 | 0.091556 | 0.857143 | 0.857143 | 0.857143 | 1.000000 | 1.200000 |
SACP | 1000.0 | 0.955077 | 0.068532 | 0.769231 | 0.923077 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.944924 | 0.080079 | 0.717954 | 0.871800 | 1.000000 | 1.000000 | 1.000000 | ||
2 | CNR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | |
SACP | 1000.0 | 0.974398 | 0.014749 | 0.965909 | 0.965909 | 0.965909 | 0.965909 | 1.000000 | ||
SACR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
5 | CNR | 1000.0 | 0.903108 | 0.054657 | 0.800000 | 0.857143 | 0.923077 | 0.923077 | 1.090909 | |
SACP | 1000.0 | 0.944734 | 0.031928 | 0.877554 | 0.948982 | 0.948982 | 0.970232 | 1.000000 | ||
SACR | 1000.0 | 0.984100 | 0.030043 | 0.871429 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
12 | CNR | 1000.0 | 1.089444 | 0.036284 | 1.000000 | 1.055556 | 1.117647 | 1.117647 | 1.117647 | |
SACP | 1000.0 | 0.874912 | 0.034691 | 0.842140 | 0.844830 | 0.857408 | 0.902064 | 0.974845 | ||
SACR | 1000.0 | 0.993704 | 0.013355 | 0.965406 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
0.90 | 0 | CNR | 1000.0 | 0.883857 | 0.058728 | 0.857143 | 0.857143 | 0.857143 | 0.857143 | 1.200000 |
SACP | 1000.0 | 0.980154 | 0.048151 | 0.769231 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.976154 | 0.055170 | 0.717954 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
2 | CNR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | |
SACP | 1000.0 | 0.969625 | 0.010629 | 0.965909 | 0.965909 | 0.965909 | 0.965909 | 1.000000 | ||
SACR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
5 | CNR | 1000.0 | 0.876832 | 0.036565 | 0.800000 | 0.857143 | 0.857143 | 0.923077 | 1.090909 | |
SACP | 1000.0 | 0.944180 | 0.023168 | 0.877554 | 0.948982 | 0.948982 | 0.948982 | 1.000000 | ||
SACR | 1000.0 | 0.992057 | 0.022303 | 0.928571 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
12 | CNR | 1000.0 | 1.107758 | 0.023718 | 1.000000 | 1.117647 | 1.117647 | 1.117647 | 1.117647 | |
SACP | 1000.0 | 0.855417 | 0.022494 | 0.842140 | 0.844830 | 0.844830 | 0.857408 | 0.971245 | ||
SACR | 1000.0 | 0.997094 | 0.009601 | 0.965406 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
As far as i'm concerned, this analysis is complete. Sticking to our intended aims (showing we can do clustering), i think we are done. I guess a reviewer who wanted to push beyond our stated goals might ask for how similar the trees are with mixed tech compared with pure illumina. I propose we do not do this.
As per https://github.com/mbhall88/head_to_head_pipeline/issues/65#issuecomment-819129800, I have removed CNR from the simulations and replaced it with XCR. The updated plots and table are below (note the table says 1-XCR but the actual value shown is XCR)
value | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | |||
ratio | threshold | metric | ||||||||
0.01 | 0 | 1-XCR | 1000.0 | 0.000073 | 0.001030 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.014599 |
SACP | 1000.0 | 0.997692 | 0.016703 | 0.846154 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.997077 | 0.020052 | 0.846154 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
2 | 1-XCR | 1000.0 | 0.000031 | 0.000493 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.007812 | |
SACP | 1000.0 | 0.999864 | 0.002153 | 0.965909 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
5 | 1-XCR | 1000.0 | 0.000385 | 0.002388 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.016393 | |
SACP | 1000.0 | 0.999171 | 0.006908 | 0.928571 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.999100 | 0.007604 | 0.928571 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
12 | 1-XCR | 1000.0 | 0.000433 | 0.002069 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.010309 | |
SACP | 1000.0 | 0.997979 | 0.010386 | 0.930189 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.999827 | 0.002441 | 0.965406 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
0.05 | 0 | 1-XCR | 1000.0 | 0.000657 | 0.003028 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.014599 |
SACP | 1000.0 | 0.990154 | 0.034728 | 0.769231 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.987898 | 0.040636 | 0.717954 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
2 | 1-XCR | 1000.0 | 0.000406 | 0.001735 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.007812 | |
SACP | 1000.0 | 0.998227 | 0.007573 | 0.965909 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
5 | 1-XCR | 1000.0 | 0.002697 | 0.006042 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.032787 | |
SACP | 1000.0 | 0.993049 | 0.019196 | 0.877554 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.993443 | 0.020012 | 0.871429 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
12 | 1-XCR | 1000.0 | 0.002082 | 0.004510 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.020619 | |
SACP | 1000.0 | 0.988758 | 0.023649 | 0.873585 | 0.987423 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.998305 | 0.007472 | 0.965406 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
0.10 | 0 | 1-XCR | 1000.0 | 0.001358 | 0.004242 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.014599 |
SACP | 1000.0 | 0.979769 | 0.048664 | 0.769231 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.975616 | 0.055949 | 0.717954 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
2 | 1-XCR | 1000.0 | 0.000789 | 0.002355 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.007812 | |
SACP | 1000.0 | 0.996557 | 0.010278 | 0.965909 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
5 | 1-XCR | 1000.0 | 0.005820 | 0.008756 | 0.000000 | 0.000000 | 0.000000 | 0.016393 | 0.040984 | |
SACP | 1000.0 | 0.987037 | 0.025115 | 0.877554 | 0.982143 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.989143 | 0.025094 | 0.871429 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
12 | 1-XCR | 1000.0 | 0.004825 | 0.006180 | 0.000000 | 0.000000 | 0.000000 | 0.010309 | 0.030928 | |
SACP | 1000.0 | 0.975084 | 0.033132 | 0.854717 | 0.943396 | 0.987423 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.996748 | 0.010101 | 0.965406 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
0.25 | 0 | 1-XCR | 1000.0 | 0.003533 | 0.006256 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.014599 |
SACP | 1000.0 | 0.957692 | 0.066539 | 0.769231 | 0.923077 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.947950 | 0.077849 | 0.717954 | 0.871800 | 1.000000 | 1.000000 | 1.000000 | ||
2 | 1-XCR | 1000.0 | 0.002070 | 0.003450 | 0.000000 | 0.000000 | 0.000000 | 0.007812 | 0.007812 | |
SACP | 1000.0 | 0.990966 | 0.015053 | 0.965909 | 0.965909 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
5 | 1-XCR | 1000.0 | 0.014615 | 0.012646 | 0.000000 | 0.000000 | 0.016393 | 0.024590 | 0.057377 | |
SACP | 1000.0 | 0.971654 | 0.034605 | 0.877554 | 0.948982 | 0.982143 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.978943 | 0.034639 | 0.871429 | 0.942857 | 1.000000 | 1.000000 | 1.000000 | ||
12 | 1-XCR | 1000.0 | 0.011629 | 0.008000 | 0.000000 | 0.010309 | 0.010309 | 0.020619 | 0.030928 | |
SACP | 1000.0 | 0.945687 | 0.041868 | 0.842140 | 0.914642 | 0.943396 | 0.983823 | 1.000000 | ||
SACR | 1000.0 | 0.994327 | 0.012816 | 0.965406 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
0.50 | 0 | 1-XCR | 1000.0 | 0.007109 | 0.007300 | 0.000000 | 0.000000 | 0.000000 | 0.014599 | 0.014599 |
SACP | 1000.0 | 0.941385 | 0.076085 | 0.769231 | 0.846154 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.928565 | 0.088733 | 0.717954 | 0.846154 | 1.000000 | 1.000000 | 1.000000 | ||
2 | 1-XCR | 1000.0 | 0.003836 | 0.003908 | 0.000000 | 0.000000 | 0.000000 | 0.007812 | 0.007812 | |
SACP | 1000.0 | 0.983261 | 0.017051 | 0.965909 | 0.965909 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
5 | 1-XCR | 1000.0 | 0.028566 | 0.015460 | 0.000000 | 0.016393 | 0.032787 | 0.040984 | 0.057377 | |
SACP | 1000.0 | 0.952238 | 0.037418 | 0.877554 | 0.928571 | 0.948982 | 0.976189 | 1.000000 | ||
SACR | 1000.0 | 0.974271 | 0.037011 | 0.871429 | 0.928571 | 1.000000 | 1.000000 | 1.000000 | ||
12 | 1-XCR | 1000.0 | 0.019155 | 0.008028 | 0.000000 | 0.010309 | 0.020619 | 0.020619 | 0.030928 | |
SACP | 1000.0 | 0.906405 | 0.043110 | 0.842140 | 0.861008 | 0.911321 | 0.930189 | 1.000000 | ||
SACR | 1000.0 | 0.991144 | 0.015105 | 0.965406 | 0.965406 | 1.000000 | 1.000000 | 1.000000 | ||
0.75 | 0 | 1-XCR | 1000.0 | 0.010847 | 0.006382 | 0.000000 | 0.000000 | 0.014599 | 0.014599 | 0.014599 |
SACP | 1000.0 | 0.955077 | 0.068532 | 0.769231 | 0.923077 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.944924 | 0.080079 | 0.717954 | 0.871800 | 1.000000 | 1.000000 | 1.000000 | ||
2 | 1-XCR | 1000.0 | 0.005867 | 0.003380 | 0.000000 | 0.007812 | 0.007812 | 0.007812 | 0.007812 | |
SACP | 1000.0 | 0.974398 | 0.014749 | 0.965909 | 0.965909 | 0.965909 | 0.965909 | 1.000000 | ||
SACR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
5 | 1-XCR | 1000.0 | 0.042090 | 0.013715 | 0.000000 | 0.032787 | 0.040984 | 0.057377 | 0.057377 | |
SACP | 1000.0 | 0.944734 | 0.031928 | 0.877554 | 0.948982 | 0.948982 | 0.970232 | 1.000000 | ||
SACR | 1000.0 | 0.984100 | 0.030043 | 0.871429 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
12 | 1-XCR | 1000.0 | 0.025639 | 0.006273 | 0.010309 | 0.020619 | 0.030928 | 0.030928 | 0.030928 | |
SACP | 1000.0 | 0.874912 | 0.034691 | 0.842140 | 0.844830 | 0.857408 | 0.902064 | 0.974845 | ||
SACR | 1000.0 | 0.993704 | 0.013355 | 0.965406 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
0.90 | 0 | 1-XCR | 1000.0 | 0.013212 | 0.004283 | 0.000000 | 0.014599 | 0.014599 | 0.014599 | 0.014599 |
SACP | 1000.0 | 0.980154 | 0.048151 | 0.769231 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SACR | 1000.0 | 0.976154 | 0.055170 | 0.717954 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
2 | 1-XCR | 1000.0 | 0.006961 | 0.002436 | 0.000000 | 0.007812 | 0.007812 | 0.007812 | 0.007812 | |
SACP | 1000.0 | 0.969625 | 0.010629 | 0.965909 | 0.965909 | 0.965909 | 0.965909 | 1.000000 | ||
SACR | 1000.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
5 | 1-XCR | 1000.0 | 0.051615 | 0.009346 | 0.016393 | 0.049180 | 0.057377 | 0.057377 | 0.057377 | |
SACP | 1000.0 | 0.944180 | 0.023168 | 0.877554 | 0.948982 | 0.948982 | 0.948982 | 1.000000 | ||
SACR | 1000.0 | 0.992057 | 0.022303 | 0.928571 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
12 | 1-XCR | 1000.0 | 0.028701 | 0.004512 | 0.010309 | 0.030928 | 0.030928 | 0.030928 | 0.030928 | |
SACP | 1000.0 | 0.855417 | 0.022494 | 0.842140 | 0.844830 | 0.844830 | 0.857408 | 0.971245 | ||
SACR | 1000.0 | 0.997094 | 0.009601 | 0.965406 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
In the update call on Monday, Simon raised a very interesting question that I think will give this work another novel component.
The question effectively boils down to
Background
From some searching through the literature there doesn't really seem to be too much work on this. Sanderson et al. touch on SNP differences between ONT and Illumina for the same sample, but not comparing across technologies and samples. Adikari et al. also did a similar comparison of pairwise consensus sequence distance, but again, only looking at the clusters from ONT compared to the clusters from Illumina.
Aim
Answer the question mentioned above. Does it make sense to try and cluster sample A and B if one is sequenced with Illumina and the other with ONT? We can effectively answer this question with similar metrics to those in #65.
Method
I see there being three main plots/results to come out of this
Discussion/Unknowns
One thing that makes a simple reproduction of the current clustering approach is that the mixed distance matrix won't be symmetrical. That is, each sample's ONT and Illumina consensus sequences are unlikely to always be the same. Therefore A (ONT) x B (Illumina) is not the same as A (Illumina) x B (ONT). I see three possible solutions to this