Practical description of the problem

mytkom commented 6 months ago

When I was considering the problem definition, I have found out that my level of knowledge of the problem isn't enough. So I've decided to play with samtools and datafiles from Michał. There are some notes of my journey:

Samtools

samtools' github repo I've builded it from source on AlmaLinux, for now everything works like a charm.

Data files

We can group data files in 3 groups:

Paired-end sequence reads stored in `.bam` files.

In our data there are 32 of these files (numbered from 1 to 32), as far as I understand each of them is a separate case for our program. Samtools is very handy to obtain some statistical data or viewing these files. There is some high abstract description (I guess this knowledge is gained through analysis): samples

Primers scheme in `.bed` format - it specifies amplicons' bounds.

There is also pairs.tsv which describes which primers are paired with each other. Each primer is tagged by id (the same in both .bed and .tsv files) and have linked (paired) primer (list of this pairs is described in pairs.tsv) ex. "nCoV-2019_1_LEFT is paired with nCoV-2019_1_RIGHT". Pair of primers corresponds to one amplicon.

Each primer in .bed file have:

id ex. nCoV-2019_1_LEFT
start and end index of reference genome ex. 24 50
base pair sequence of primer (length of 20-30 bp in our data)
either "+" or "-" sign (as far as I am concerned left bound of amplicon is "+", right is "-")
some integer either 1 or 2 - i don't know what it means

ex. 3 lines of .bed

MN908947.3  24  50  nCoV-2019_1_LEFT    1   +   AACAAACCAACCAACTTTCGATCTC
MN908947.3  408 432 nCoV-2019_1_RIGHT   1   -   CTTCTACTAAGCCACAAGTGCCA
MN908947.3  323 344 nCoV-2019_2_LEFT    2   +   TTTACAGGTTCGCGACGTGC
...

IMPORTANT

.bed file seems already paired (each two primares with the same number and opposite "side" (LEFT/RIGHT), they are even next to each other). But there are also *_alt primers, .bed file ex.

...
MN908947.3  2482    2508    nCoV-2019_9_LEFT    1   +   TCTTCTTAGAGGGAGAAACACTTCC
MN908947.3  2861    2886    nCoV-2019_9_RIGHT   1   -   CACAGGCGAACTCATTTACTTCTG

// *** look at these four
MN908947.3  2825    2850    nCoV-2019_10_LEFT   2   +   TGAGAAGTGCTCTGCCTATACAGT
MN908947.3  2779    2813    nCoV-2019_10_LEFT_alt   2   +   TGAATATCACTTTTGAACTTGATGAAAGGATTG
MN908947.3  3183    3211    nCoV-2019_10_RIGHT  2   -   TCATCTAACCAATCTTCTTCTTGCTCT
MN908947.3  3156    3178    nCoV-2019_10_RIGHT_alt  2   -   GGTTGAAGAGCAGCAGAAGTG
// ***

MN908947.3  3077    3102    nCoV-2019_11_LEFT   1   +   AGAAGAGTTTGAGCCATCAACTCA
MN908947.3  3470    3493    nCoV-2019_11_RIGHT  1   -   TTTAAGGCTCCTGCAACACCTC
...

I guess alt means "alternative" primer. But don't worry pairs.tsv specifies which of them are paired, for this example:

...
nCoV-2019_9_LEFT    nCoV-2019_9_RIGHT
nCoV-2019_10_LEFT_alt   nCoV-2019_10_RIGHT
nCoV-2019_11_LEFT   nCoV-2019_11_RIGHT
...

So we take as amplicon bounds nCoV-2019_10_LEFT_alt and nCoV-2019_10_RIGHT primers. I believe we shouldn't care about it, and our reading lib would take care of it, but still it is good to know domain knowledge.

Gene and protein data in `.gff3`

It describes (potential?) position of genes and proteins in reference genome. This data I suppose is very useful in analysis, but should be important for us. Along with standarized id and context information about gene/virus/protein. Indices of start and end in reference genome. It starts with description of the whole virus, and then splits into gene/protein sections first is 5' primer, last is 3' primer. Example describe a lot in this files:

MN908947.3  Genbank region  1   29903   .   +   .   ID=MN908947.3:1..29903;Dbxref=taxon:2697049;collection-date=Dec-2019;country=China;gb-acronym=SARS-CoV-2;gbkey=Src;genome=genomic;isolate=Wuhan-Hu-1;mol_type=genomic RNA;nat-host=Homo sapiens;old-name=Wuhan seafood market pneumonia virus
MN908947.3  Genbank five_prime_UTR  1   265 .   +   .   ID=id-MN908947.3:1..265;gbkey=5'UTR
MN908947.3  Genbank gene    266 21555   .   +   .   ID=gene-orf1ab;Name=orf1ab;gbkey=Gene;gene=orf1ab;gene_biotype=protein_coding
MN908947.3  Genbank CDS 266 13468   .   +   0   ID=cds-QHD43415.1;Parent=gene-orf1ab;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;product=orf1ab polyprotein;protein_id=QHD43415.1
MN908947.3  Genbank CDS 13468   21555   .   +   0   ID=cds-QHD43415.1;Parent=gene-orf1ab;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;product=orf1ab polyprotein;protein_id=QHD43415.1
MN908947.3  Genbank gene    21563   25384   .   +   .   ID=gene-S;Name=S;gbkey=Gene;gene=S;gene_biotype=protein_coding
MN908947.3  Genbank CDS 21563   25384   .   +   0   ID=cds-QHD43416.1;Parent=gene-S;Dbxref=NCBI_GP:QHD43416.1;Name=QHD43416.1;Note=structural protein;gbkey=CDS;gene=S;product=surface glycoprotein;protein_id=QHD43416.1
...
MN908947.3  Genbank gene    29558   29674   .   +   .   ID=gene-ORF10;Name=ORF10;gbkey=Gene;gene=ORF10;gene_biotype=protein_coding
MN908947.3  Genbank CDS 29558   29674   .   +   0   ID=cds-QHI42199.1;Parent=gene-ORF10;Dbxref=NCBI_GP:QHI42199.1;Name=QHI42199.1;gbkey=CDS;gene=ORF10;product=ORF10 protein;protein_id=QHI42199.1
MN908947.3  Genbank three_prime_UTR 29675   29903   .   +   .   ID=id-MN908947.3:29675..29903;gbkey=3'UTR

There is also .fasta file with reference genome, I suppose.

mytkom commented 6 months ago

I have analysed all datasets for number of single read sequence, since it is paired nearly 100% (there are only 0.x% singletons). These are the results:

Dataset number	Number of read (single reads)
1	2333191
2	4362717
3	3437908
4	5562671
5	4680308
6	3471659
7	91209
8	5336291
9	90887
10	5470750
11	3695449
12	88191
13	101141
14	3627143
15	93783
16	4982672
17	94913
18	853066
19	4255664
20	5528771
21	3590340
22	39
23	50
24	92863
25	2926185
26	3463991
27	97111
28	4923439
29	3409759
30	114966
31	97876
32	99354

22 and 23 are marked as influenza virus it is a bad data. But this is interesting information that we have less than 6 million of single reads for every dataset, which means 3 million paired end sequences. It is about 5 times smaller amount than discussed. But take into consideration that it is only on set of data - in other sets it can be far more reads.

mytkom commented 6 months ago

Dataset 03, analasys with amplicon context:

Amplicon number	number with both primers from this amplicon	number with primers from different amplicon	number with a position not matching any valid amplicon primer site
SUM	480757	13854	951768
1	263	7	1401
2	5681	62	11166
3	7305	41	12125
4	9409	114	12888
5	676	74	2548
6	8343	131	19841
7	10256	152	18469
8	189	37	490
9	8108	21	14137
10	2140	54	1536
11	4782	182	12551
12	10087	158	19003
13	30	156	87
14	418	6	2944
15	1322	21	1746
16	5664	42	9027
17	8055	117	13244
18	12714	124	27682
19	5479	109	10307
20	672	46	2142
21	0	11	21
22	2	5	30
23	131	43	465
24	99	20	299
25	8409	53	15257
26	2757	191	8908
27	313	23	477
28	7375	39	9069
29	108	108	473
30	707	31	1066
31	153	35	470
32	764	70	2829
33	670	35	4224
34	9801	133	18375
35	8680	138	19167
36	2933	143	5778
37	8667	161	20934
38	8391	639	21109
39	431	50	1480
40	6259	21	6391
41	2087	98	6633
42	12250	180	17079
43	1767	185	9648
44	30807	463	25442
45	87	17	257
46	5302	139	8322
47	3014	78	7775
48	8361	189	10397
49	6	22	68
50	3532	48	5974
51	59	98	230
52	17403	84	38457
53	22853	732	40094
54	15434	1425	23734
55	7014	362	19507
56	10750	216	26064
57	249	252	1172
58	6889	54	16017
59	5871	99	9632
60	138	69	487
61	7690	66	16985
62	5165	53	12533
63	2624	156	7909
64	34	38	131
65	6744	47	12597
66	5	53	48
67	5890	148	11923
68	1833	310	5782
69	2236	195	5098
70	4603	83	5793
71	398	86	1970
72	4628	182	12248
73	591	46	2576
74	4	45	99
75	1451	28	3266
76	97	87	407
77	4770	214	9623
78	7418	153	27277
79	1	104	37
80	433	7	1010
81	4889	55	7572
82	4957	164	16984
83	1636	41	3073
84	3486	81	9592
85	3648	123	6482
86	684	28	1890
87	16449	146	32118
88	489	225	2189
89	4126	28	7663
90	16	128	67
91	623	205	2996
92	8011	310	12109
93	7312	681	28748
94	8230	567	22795
95	8748	324	15677
96	5543	59	15012
97	6352	58	11866
98	14352	68	16489
99	2475	49	3989

mytkom commented 6 months ago

In .bam files there can be Levenstein Distance (NM) to reference written for each sequence XD.

Example row of .bam

NB501755:469:HFVM3BGXM:1:23101:18289:13608  99  MN908947.3  51  60  11S139M =   272 382 CTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGG  AAAAAEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEAEEEAEAEEEEEEE/EEEEEEEEEEEEEEAEEEEEEEAEEEEEE<EE<  NM:i:0  MD:Z:150    MC:Z:150M   AS:i:150    XS:i:0  XA:i:0

I think that AS is our sequence quality (AS - alignment score). But this is a great question for Michał, because from .sam wikipedia: https://en.wikipedia.org/wiki/SAM_(file_format) It is and optional value.

mytkom commented 6 months ago

Amplicon is 400 bp long on average (300-500bp)

migoox / genome-downsampler