migoox / genome-downsampler

Other
2 stars 0 forks source link

Practical description of the problem #1

Open migoox opened 6 months ago

mytkom commented 6 months ago

When I was considering the problem definition, I have found out that my level of knowledge of the problem isn't enough. So I've decided to play with samtools and datafiles from Michał. There are some notes of my journey:

Samtools

samtools' github repo I've builded it from source on AlmaLinux, for now everything works like a charm.

Data files

We can group data files in 3 groups:

Paired-end sequence reads stored in .bam files.

In our data there are 32 of these files (numbered from 1 to 32), as far as I understand each of them is a separate case for our program. Samtools is very handy to obtain some statistical data or viewing these files. There is some high abstract description (I guess this knowledge is gained through analysis): samples

Primers scheme in .bed format - it specifies amplicons' bounds.

There is also pairs.tsv which describes which primers are paired with each other. Each primer is tagged by id (the same in both .bed and .tsv files) and have linked (paired) primer (list of this pairs is described in pairs.tsv) ex. "nCoV-2019_1_LEFT is paired with nCoV-2019_1_RIGHT". Pair of primers corresponds to one amplicon.

Each primer in .bed file have:

ex. 3 lines of .bed

MN908947.3  24  50  nCoV-2019_1_LEFT    1   +   AACAAACCAACCAACTTTCGATCTC
MN908947.3  408 432 nCoV-2019_1_RIGHT   1   -   CTTCTACTAAGCCACAAGTGCCA
MN908947.3  323 344 nCoV-2019_2_LEFT    2   +   TTTACAGGTTCGCGACGTGC
...

IMPORTANT

.bed file seems already paired (each two primares with the same number and opposite "side" (LEFT/RIGHT), they are even next to each other). But there are also *_alt primers, .bed file ex.

...
MN908947.3  2482    2508    nCoV-2019_9_LEFT    1   +   TCTTCTTAGAGGGAGAAACACTTCC
MN908947.3  2861    2886    nCoV-2019_9_RIGHT   1   -   CACAGGCGAACTCATTTACTTCTG

// *** look at these four
MN908947.3  2825    2850    nCoV-2019_10_LEFT   2   +   TGAGAAGTGCTCTGCCTATACAGT
MN908947.3  2779    2813    nCoV-2019_10_LEFT_alt   2   +   TGAATATCACTTTTGAACTTGATGAAAGGATTG
MN908947.3  3183    3211    nCoV-2019_10_RIGHT  2   -   TCATCTAACCAATCTTCTTCTTGCTCT
MN908947.3  3156    3178    nCoV-2019_10_RIGHT_alt  2   -   GGTTGAAGAGCAGCAGAAGTG
// ***

MN908947.3  3077    3102    nCoV-2019_11_LEFT   1   +   AGAAGAGTTTGAGCCATCAACTCA
MN908947.3  3470    3493    nCoV-2019_11_RIGHT  1   -   TTTAAGGCTCCTGCAACACCTC
...

I guess alt means "alternative" primer. But don't worry pairs.tsv specifies which of them are paired, for this example:

...
nCoV-2019_9_LEFT    nCoV-2019_9_RIGHT
nCoV-2019_10_LEFT_alt   nCoV-2019_10_RIGHT
nCoV-2019_11_LEFT   nCoV-2019_11_RIGHT
...

So we take as amplicon bounds nCoV-2019_10_LEFT_alt and nCoV-2019_10_RIGHT primers. I believe we shouldn't care about it, and our reading lib would take care of it, but still it is good to know domain knowledge.

Gene and protein data in .gff3

It describes (potential?) position of genes and proteins in reference genome. This data I suppose is very useful in analysis, but should be important for us. Along with standarized id and context information about gene/virus/protein. Indices of start and end in reference genome. It starts with description of the whole virus, and then splits into gene/protein sections first is 5' primer, last is 3' primer. Example describe a lot in this files:

MN908947.3  Genbank region  1   29903   .   +   .   ID=MN908947.3:1..29903;Dbxref=taxon:2697049;collection-date=Dec-2019;country=China;gb-acronym=SARS-CoV-2;gbkey=Src;genome=genomic;isolate=Wuhan-Hu-1;mol_type=genomic RNA;nat-host=Homo sapiens;old-name=Wuhan seafood market pneumonia virus
MN908947.3  Genbank five_prime_UTR  1   265 .   +   .   ID=id-MN908947.3:1..265;gbkey=5'UTR
MN908947.3  Genbank gene    266 21555   .   +   .   ID=gene-orf1ab;Name=orf1ab;gbkey=Gene;gene=orf1ab;gene_biotype=protein_coding
MN908947.3  Genbank CDS 266 13468   .   +   0   ID=cds-QHD43415.1;Parent=gene-orf1ab;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;product=orf1ab polyprotein;protein_id=QHD43415.1
MN908947.3  Genbank CDS 13468   21555   .   +   0   ID=cds-QHD43415.1;Parent=gene-orf1ab;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;product=orf1ab polyprotein;protein_id=QHD43415.1
MN908947.3  Genbank gene    21563   25384   .   +   .   ID=gene-S;Name=S;gbkey=Gene;gene=S;gene_biotype=protein_coding
MN908947.3  Genbank CDS 21563   25384   .   +   0   ID=cds-QHD43416.1;Parent=gene-S;Dbxref=NCBI_GP:QHD43416.1;Name=QHD43416.1;Note=structural protein;gbkey=CDS;gene=S;product=surface glycoprotein;protein_id=QHD43416.1
...
MN908947.3  Genbank gene    29558   29674   .   +   .   ID=gene-ORF10;Name=ORF10;gbkey=Gene;gene=ORF10;gene_biotype=protein_coding
MN908947.3  Genbank CDS 29558   29674   .   +   0   ID=cds-QHI42199.1;Parent=gene-ORF10;Dbxref=NCBI_GP:QHI42199.1;Name=QHI42199.1;gbkey=CDS;gene=ORF10;product=ORF10 protein;protein_id=QHI42199.1
MN908947.3  Genbank three_prime_UTR 29675   29903   .   +   .   ID=id-MN908947.3:29675..29903;gbkey=3'UTR

There is also .fasta file with reference genome, I suppose.

mytkom commented 6 months ago

I have analysed all datasets for number of single read sequence, since it is paired nearly 100% (there are only 0.x% singletons). These are the results:

Dataset number Number of read (single reads)
1 2333191
2 4362717
3 3437908
4 5562671
5 4680308
6 3471659
7 91209
8 5336291
9 90887
10 5470750
11 3695449
12 88191
13 101141
14 3627143
15 93783
16 4982672
17 94913
18 853066
19 4255664
20 5528771
21 3590340
22 39
23 50
24 92863
25 2926185
26 3463991
27 97111
28 4923439
29 3409759
30 114966
31 97876
32 99354

22 and 23 are marked as influenza virus it is a bad data. But this is interesting information that we have less than 6 million of single reads for every dataset, which means 3 million paired end sequences. It is about 5 times smaller amount than discussed. But take into consideration that it is only on set of data - in other sets it can be far more reads.

mytkom commented 6 months ago

Dataset 03, analasys with amplicon context:

Amplicon number number with both primers from this amplicon number with primers from different amplicon number with a position not matching any valid amplicon primer site
SUM 480757 13854 951768
1 263 7 1401
2 5681 62 11166
3 7305 41 12125
4 9409 114 12888
5 676 74 2548
6 8343 131 19841
7 10256 152 18469
8 189 37 490
9 8108 21 14137
10 2140 54 1536
11 4782 182 12551
12 10087 158 19003
13 30 156 87
14 418 6 2944
15 1322 21 1746
16 5664 42 9027
17 8055 117 13244
18 12714 124 27682
19 5479 109 10307
20 672 46 2142
21 0 11 21
22 2 5 30
23 131 43 465
24 99 20 299
25 8409 53 15257
26 2757 191 8908
27 313 23 477
28 7375 39 9069
29 108 108 473
30 707 31 1066
31 153 35 470
32 764 70 2829
33 670 35 4224
34 9801 133 18375
35 8680 138 19167
36 2933 143 5778
37 8667 161 20934
38 8391 639 21109
39 431 50 1480
40 6259 21 6391
41 2087 98 6633
42 12250 180 17079
43 1767 185 9648
44 30807 463 25442
45 87 17 257
46 5302 139 8322
47 3014 78 7775
48 8361 189 10397
49 6 22 68
50 3532 48 5974
51 59 98 230
52 17403 84 38457
53 22853 732 40094
54 15434 1425 23734
55 7014 362 19507
56 10750 216 26064
57 249 252 1172
58 6889 54 16017
59 5871 99 9632
60 138 69 487
61 7690 66 16985
62 5165 53 12533
63 2624 156 7909
64 34 38 131
65 6744 47 12597
66 5 53 48
67 5890 148 11923
68 1833 310 5782
69 2236 195 5098
70 4603 83 5793
71 398 86 1970
72 4628 182 12248
73 591 46 2576
74 4 45 99
75 1451 28 3266
76 97 87 407
77 4770 214 9623
78 7418 153 27277
79 1 104 37
80 433 7 1010
81 4889 55 7572
82 4957 164 16984
83 1636 41 3073
84 3486 81 9592
85 3648 123 6482
86 684 28 1890
87 16449 146 32118
88 489 225 2189
89 4126 28 7663
90 16 128 67
91 623 205 2996
92 8011 310 12109
93 7312 681 28748
94 8230 567 22795
95 8748 324 15677
96 5543 59 15012
97 6352 58 11866
98 14352 68 16489
99 2475 49 3989
mytkom commented 6 months ago

In .bam files there can be Levenstein Distance (NM) to reference written for each sequence XD.

Example row of .bam

NB501755:469:HFVM3BGXM:1:23101:18289:13608  99  MN908947.3  51  60  11S139M =   272 382 CTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGG  AAAAAEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEAEEEAEAEEEEEEE/EEEEEEEEEEEEEEAEEEEEEEAEEEEEE<EE<  NM:i:0  MD:Z:150    MC:Z:150M   AS:i:150    XS:i:0  XA:i:0

I think that AS is our sequence quality (AS - alignment score). But this is a great question for Michał, because from .sam wikipedia: https://en.wikipedia.org/wiki/SAM_(file_format) It is and optional value.

mytkom commented 6 months ago

Amplicon is 400 bp long on average (300-500bp)