Open migoox opened 6 months ago
I have analysed all datasets for number of single read sequence, since it is paired nearly 100% (there are only 0.x% singletons). These are the results:
Dataset number | Number of read (single reads) |
---|---|
1 | 2333191 |
2 | 4362717 |
3 | 3437908 |
4 | 5562671 |
5 | 4680308 |
6 | 3471659 |
7 | 91209 |
8 | 5336291 |
9 | 90887 |
10 | 5470750 |
11 | 3695449 |
12 | 88191 |
13 | 101141 |
14 | 3627143 |
15 | 93783 |
16 | 4982672 |
17 | 94913 |
18 | 853066 |
19 | 4255664 |
20 | 5528771 |
21 | 3590340 |
22 | 39 |
23 | 50 |
24 | 92863 |
25 | 2926185 |
26 | 3463991 |
27 | 97111 |
28 | 4923439 |
29 | 3409759 |
30 | 114966 |
31 | 97876 |
32 | 99354 |
22 and 23 are marked as influenza virus it is a bad data. But this is interesting information that we have less than 6 million of single reads for every dataset, which means 3 million paired end sequences. It is about 5 times smaller amount than discussed. But take into consideration that it is only on set of data - in other sets it can be far more reads.
Dataset 03, analasys with amplicon context:
Amplicon number | number with both primers from this amplicon | number with primers from different amplicon | number with a position not matching any valid amplicon primer site |
---|---|---|---|
SUM | 480757 | 13854 | 951768 |
1 | 263 | 7 | 1401 |
2 | 5681 | 62 | 11166 |
3 | 7305 | 41 | 12125 |
4 | 9409 | 114 | 12888 |
5 | 676 | 74 | 2548 |
6 | 8343 | 131 | 19841 |
7 | 10256 | 152 | 18469 |
8 | 189 | 37 | 490 |
9 | 8108 | 21 | 14137 |
10 | 2140 | 54 | 1536 |
11 | 4782 | 182 | 12551 |
12 | 10087 | 158 | 19003 |
13 | 30 | 156 | 87 |
14 | 418 | 6 | 2944 |
15 | 1322 | 21 | 1746 |
16 | 5664 | 42 | 9027 |
17 | 8055 | 117 | 13244 |
18 | 12714 | 124 | 27682 |
19 | 5479 | 109 | 10307 |
20 | 672 | 46 | 2142 |
21 | 0 | 11 | 21 |
22 | 2 | 5 | 30 |
23 | 131 | 43 | 465 |
24 | 99 | 20 | 299 |
25 | 8409 | 53 | 15257 |
26 | 2757 | 191 | 8908 |
27 | 313 | 23 | 477 |
28 | 7375 | 39 | 9069 |
29 | 108 | 108 | 473 |
30 | 707 | 31 | 1066 |
31 | 153 | 35 | 470 |
32 | 764 | 70 | 2829 |
33 | 670 | 35 | 4224 |
34 | 9801 | 133 | 18375 |
35 | 8680 | 138 | 19167 |
36 | 2933 | 143 | 5778 |
37 | 8667 | 161 | 20934 |
38 | 8391 | 639 | 21109 |
39 | 431 | 50 | 1480 |
40 | 6259 | 21 | 6391 |
41 | 2087 | 98 | 6633 |
42 | 12250 | 180 | 17079 |
43 | 1767 | 185 | 9648 |
44 | 30807 | 463 | 25442 |
45 | 87 | 17 | 257 |
46 | 5302 | 139 | 8322 |
47 | 3014 | 78 | 7775 |
48 | 8361 | 189 | 10397 |
49 | 6 | 22 | 68 |
50 | 3532 | 48 | 5974 |
51 | 59 | 98 | 230 |
52 | 17403 | 84 | 38457 |
53 | 22853 | 732 | 40094 |
54 | 15434 | 1425 | 23734 |
55 | 7014 | 362 | 19507 |
56 | 10750 | 216 | 26064 |
57 | 249 | 252 | 1172 |
58 | 6889 | 54 | 16017 |
59 | 5871 | 99 | 9632 |
60 | 138 | 69 | 487 |
61 | 7690 | 66 | 16985 |
62 | 5165 | 53 | 12533 |
63 | 2624 | 156 | 7909 |
64 | 34 | 38 | 131 |
65 | 6744 | 47 | 12597 |
66 | 5 | 53 | 48 |
67 | 5890 | 148 | 11923 |
68 | 1833 | 310 | 5782 |
69 | 2236 | 195 | 5098 |
70 | 4603 | 83 | 5793 |
71 | 398 | 86 | 1970 |
72 | 4628 | 182 | 12248 |
73 | 591 | 46 | 2576 |
74 | 4 | 45 | 99 |
75 | 1451 | 28 | 3266 |
76 | 97 | 87 | 407 |
77 | 4770 | 214 | 9623 |
78 | 7418 | 153 | 27277 |
79 | 1 | 104 | 37 |
80 | 433 | 7 | 1010 |
81 | 4889 | 55 | 7572 |
82 | 4957 | 164 | 16984 |
83 | 1636 | 41 | 3073 |
84 | 3486 | 81 | 9592 |
85 | 3648 | 123 | 6482 |
86 | 684 | 28 | 1890 |
87 | 16449 | 146 | 32118 |
88 | 489 | 225 | 2189 |
89 | 4126 | 28 | 7663 |
90 | 16 | 128 | 67 |
91 | 623 | 205 | 2996 |
92 | 8011 | 310 | 12109 |
93 | 7312 | 681 | 28748 |
94 | 8230 | 567 | 22795 |
95 | 8748 | 324 | 15677 |
96 | 5543 | 59 | 15012 |
97 | 6352 | 58 | 11866 |
98 | 14352 | 68 | 16489 |
99 | 2475 | 49 | 3989 |
In .bam
files there can be Levenstein Distance (NM) to reference written for each sequence XD.
Example row of .bam
NB501755:469:HFVM3BGXM:1:23101:18289:13608 99 MN908947.3 51 60 11S139M = 272 382 CTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGG AAAAAEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEAEEEAEAEEEEEEE/EEEEEEEEEEEEEEAEEEEEEEAEEEEEE<EE< NM:i:0 MD:Z:150 MC:Z:150M AS:i:150 XS:i:0 XA:i:0
I think that AS is our sequence quality (AS - alignment score). But this is a great question for Michał, because from .sam
wikipedia:
https://en.wikipedia.org/wiki/SAM_(file_format)
It is and optional value.
Amplicon is 400 bp long on average (300-500bp)
When I was considering the problem definition, I have found out that my level of knowledge of the problem isn't enough. So I've decided to play with samtools and datafiles from Michał. There are some notes of my journey:
Samtools
samtools' github repo I've builded it from source on AlmaLinux, for now everything works like a charm.
Data files
We can group data files in 3 groups:
Paired-end sequence reads stored in
.bam
files.In our data there are 32 of these files (numbered from 1 to 32), as far as I understand each of them is a separate case for our program. Samtools is very handy to obtain some statistical data or viewing these files. There is some high abstract description (I guess this knowledge is gained through analysis):
Primers scheme in
.bed
format - it specifies amplicons' bounds.There is also
pairs.tsv
which describes which primers are paired with each other. Each primer is tagged by id (the same in both.bed
and.tsv
files) and have linked (paired) primer (list of this pairs is described in pairs.tsv) ex. "nCoV-2019_1_LEFT
is paired withnCoV-2019_1_RIGHT
". Pair of primers corresponds to one amplicon.Each primer in .bed file have:
nCoV-2019_1_LEFT
24
50
20-30
bp in our data)ex. 3 lines of .bed
IMPORTANT
.bed file seems already paired (each two primares with the same number and opposite "side" (LEFT/RIGHT), they are even next to each other). But there are also
*_alt
primers, .bed file ex.I guess alt means "alternative" primer. But don't worry
pairs.tsv
specifies which of them are paired, for this example:So we take as amplicon bounds
nCoV-2019_10_LEFT_alt
andnCoV-2019_10_RIGHT
primers. I believe we shouldn't care about it, and our reading lib would take care of it, but still it is good to know domain knowledge.Gene and protein data in
.gff3
It describes (potential?) position of genes and proteins in reference genome. This data I suppose is very useful in analysis, but should be important for us. Along with standarized id and context information about gene/virus/protein. Indices of start and end in reference genome. It starts with description of the whole virus, and then splits into gene/protein sections first is 5' primer, last is 3' primer. Example describe a lot in this files:
There is also
.fasta
file with reference genome, I suppose.