Inquiry for help in setting the parameters

kaltinel commented 9 months ago

Hi,

Many thanks for your tool! I am working with nanopore signals collected from two different sources (source_1andsource_2), where I want to get their consensus. I have some questions regarding how to use it in the right way. I am currently processing them with:

# source 1 open & global
./openDBA tsv float open source_1_open 0 /dev/null 1 source_1.tsv
./openDBA tsv float global source_1_global 0 /dev/null 1 source_1.tsv

# source 2 open & global
./openDBA tsv float open source_2_open 0 /dev/null 1 source_2.tsv
./openDBA tsv float global source_2_global 0 /dev/null 1 source_2.tsv

The data shape is: c16f7a36-ddf827-49840-9683c 615 640 637 624 664 673 654 575 579 607 686 754 695 739 837

1) In each tsv file, all the reads come from the same source so they should the same reference and features, so I assume they should create only 1 cluster. However, shall I also take noise into account and increase the cluster number to 2?

2) The signals I want to get the consensus are extracted from the mid-read. Hence I am a bit confused if I should use open or global as my parameter for openness. I used open-end, and the results were very strange, which I believe was due to forcing the signals starting from an open channel, although they were extracted from the middle. Is my making-sense correct or am I way off?

3) My data is integers so would there be any harm of using floats, or does the parameter strictly be integers?

4) I am confused about the minimum uni-modal segment length for clustering. When I use 0, the read lengths are not being complained. But when I increase it to for ex.5, some reads need to be removed to have better convergence. Can you please elaborate on what does this parameter do, as my experimenting didn't teach me much.

I would be so glad to hear from your side, and many thanks for your time and help.

nodrogluap commented 9 months ago

Hi Kaitlin,

Can you tell me a bit more about what your data is and what you're trying to do? Is this DNA or RNA? Are the TSVs the full raw 5000Hz samples pulled from FAST5 files? Or the "events" (i.e. about 1:1 with the number of expected bases)? If that is the case zero is the correct value for segmentation.

Have the adapters been trimmed? You say you're looking to generate a consensus from data in the middle of the read...this probably won't work as expected with "open" unless the signals are strict subsets of each other (i.e. if the sequence is variable on either end these will muddle the consensus alignment). The typical use cases are global if all the underlying sequences are the same end-to-end, or open-end if it's direct RNA (where the 5' tail at the end of the signal will be truncated in some reads). The open mode works if all seqs are strict subsequences of the same longest seq.

With some more details we can hopefully get you going in the right direction.

kaltinel commented 9 months ago

Hi @nodrogluap ,

Many thanks for your reply. Sure, gladly. Here are some details:

I am using RNA data, which have been pulled from a slow5 file. I am usingf5c eventalign to extract certain regions of interest from each source (say, a signal index that corresponds to a nucleotide sequence of interest as AGCTAGCTAGCTAGCTA which exists in both sources). So, this region is ofcourse different for each read (for ex. it aligns to [100:120] for read_1, and [154:199] for read_2), and I use psylow5 to extract these regions from the each source slow5.
- So the tsv file has the readids and extracted pA values that corresponds to this above sequence in the signals (I try scaled and non-scaled both).
Adapters are not trimmed because i) I don't know the sequence of the direct RNA adapters, ii) I thought I am using the signal from the mid-region so there should be no adapters.
In each file (coming from source 1 or source 2), the sequences should have the same reference and the same everything, end to end, provided that my extraction is correct, ofcourse.

I highly appreciate your help.

nodrogluap commented 9 months ago

Thanks for the details. In this case, you are on the right track with float/global/0-segmentation. I would suggest increasing the cluster size to 2 or more for sure, as there are always outliers in almost any real dataset. Assuming you have at least 100 reads in each dataset, if the ensemble of reads is very homogeneous the worse that can happen is that 10-20 reads get cleaved off into small clusters or singletons and the vast majority remain in the main cluster. Note that you can run both TSV files in a single OpenDBA run by providing multiple input file args. If they are indeed statistically different, with k-means >= 2 the reads from the first TSV file should cluster together, and the reads from the second should fall in their own cluster. If they do not, you may need to check the data sanity :-)

kaltinel commented 9 months ago

Many thanks for your reply. I am glad to learn that the parameters are correct.

I will increase the cluster size to 3, to account the noise and outliers (I am using ~900 reads), however, I would not expect that they should create clusters, as when I run all 900 reads together, the opendba exits with a converged centroid, with a delta of 0. These are the last terminal outputs:

New delta is 1.19209e-07
Step 3 of 3 (round 141 of max 250 to achieve delta 0) for cluster 1/1: Converging centroid
0%        10%       20%       30%       40%       50%       60%       70%       80%       90%       100%                                                                                                  ....................................................................................................|
New delta is 0

I interpret this result as, the consensus has been found at 141st iteration with no delta - aka nothing to converge anymore-. So, why would these reads have noise or outliers?

I would be happy to learn.

nodrogluap commented 9 months ago

The DBA algorithm by its very nature will always generate a converged centroid regardless of the input. Whether that makes sense biological is another matter :-) You could imagine that if you had 900 completely random signals the centroid would be something close to a flat line. Now imagine that you had 450 reads of exactly one perfectly replicated signal (let's call it signalX) and another 450 of that same signal except that the 5th number in the series was slightly higher (let's call it sigX'). Asking for a single cluster would generate a consensus exactly the same as signalX except the 5th number in the series would be half way between the sigX and sigX' values in that position. Asking for two clusters would recapitulate sigX and sigX' as the two consensuses. With real signals where there is noise, the same principle applies.

nodrogluap commented 9 months ago

I'd strongly suggest that you use the pair_dists.txt file from the combined source_1 & source_2 run, then produce a dendrogram from it using the example code in the README.md. This can give you some idea of whether your sequence sections were extracted properly. i.e. you should have just two large subtrees based on your description.

kaltinel commented 9 months ago

Hi again,

Many thanks for your replies.

I understand the explanation. The data should cluster into two with their noises. However when I run the combined source 1 and 2 and ask for two clusters, the clusters are with 3382 and 129 members, respectively. Which should not be the case, as I am pretty sure now that I have extracted the correct signal indexes.. I would love to plot the distances but can you please elaborate what pair_dists.txt file contains (aka how to interpret it)? In addition, I am unable to see instructions or dendogram script under graphing, am I in the wrong place?

/OpenDBA/graphing$ ls
centroid_line_plot.R  diag_median_proportion.R  diagonal_median_proportion.sh  dtw_path_generation.R  plot_significant_positions.R  README.md  venn_diagrams.R  violin_plots.R
/OpenDBA/graphing$ cat README.md 
Scripts for visualizing statistical test results on aligned signal clusters.

Many thanks,

nodrogluap commented 9 months ago

Hi,

The dendrogram example is just a few lines of code in the main README.md, not a full script under graphing. The pair_dist.txt file is in the standard R format for showing the distance between each pair of sequences. If you are getting 3382 and 129 member clusters, assuming you've excised the important bit appropriately, I suspect that you have a cluster of really noisy reads, then the difference between your two "real" sets is insignificant in comparison. With real data I suggest providing some extra clusters to sink these noisy reads into.

nodrogluap commented 8 months ago

If you have any further troubles, I'd be happy to hear about it on a new ticket. Closing this one for now.

nodrogluap / OpenDBA

Inquiry for help in setting the parameters #20