rnajena / vidhop

VIDHOP is a virus host predicting tool. Its able to predict influenza A virus, rabies lyssavirus and rotavirus A.
18 stars 8 forks source link

0.95 quantile length of all sequences #5

Closed Trin21 closed 3 years ago

Trin21 commented 3 years ago

Hi. I came across this paper, and I would like to congratulate you for such inspiring work. However, I have a small doubt which I would appreciate if you could kindly clarify. In the input preparation part, it is written that "To achieve this, the length of the sequences was limited to the 0.95 quantiles of all sequence lengths by truncating the first positions or in the case of shorter sequences by extension". Could you kindly explain what this means? Also, doesn't trimming affect the results of the same?

flomock commented 3 years ago

Hi, thanks for the question. So let me give you an example:

lets say we have 4 sequences:

1. ATATATATATTACCGCGAGATATCGA (length = 26)
2. ATCAGAGTAATATATTACCGCGAGATATCGA (length = 31)
3. ATCAGAGTAATATATTACCGCGAGATATTTTTTTT (length = 35)
4. GGGGATCCCAGTTTTATATTACCGCGAGATATTTTTTTT (length = 39)

And lets say we don't use the 0.95 quantile but the 0.75 quantile, than this method look for the length of the sequence for which 75% of the sequence have the same or shorter length. In this example this would be length = 35.

So from the 4. Sequence we would cut of the first 4 nucleotides and the other sequences will be extended to length 35. Resulting in:

1. ATATATATATTACCGCGAGATATCGA--------- (length = 35)
2. ATCAGAGTAATATATTACCGCGAGATATCGA---- (length = 35)
3. ATCAGAGTAATATATTACCGCGAGATATTTTTTTT (length = 35)
4. ATCCCAGTTTTATATTACCGCGAGATATTTTTTTT (length = 35)
Trin21 commented 3 years ago

Oh, I understood now. Thank you so much!