Closed iqbal-lab closed 2 years ago
This looks super helpful for casually interrogating outputs 🙏
On Wed, 10 Aug 2022 at 13:18, Zamin Iqbal @.***> wrote:
Have just chatted through the TSV format with @jeff-k https://github.com/jeff-k , and am logging here my proposed changes to the format, which Jeff is happy with
Current format 31 0 30 A 453 1 454 0 0 453 11 A:453;T:1 primer_calls_ignored=0/0;total_primer_bases=453/1;unfiltered_depth=454;total=453/1;amplicon_overlap=1;amplicon_totals=453/1;amplicon_names=nCoV-2019_1_pool1 32 1 31 C 469 1 470 0 0 469 11 C:469;T:1 primer_calls_ignored=0/0;total_primer_bases=469/1;unfiltered_depth=470;total=469/1;amplicon_overlap=1;amplicon_totals=469/1;amplicon_names=nCoV-2019_1_pool1 33 2 32 C 508 1 509 0 0 508 11 C:508;T:1 primer_calls_ignored=0/0;total_primer_bases=508/1;unfiltered_depth=509;total=508/1;amplicon_overlap=1;amplicon_totals=508/1;amplicon_names=nCoV-2019_1_pool1 34 3 33 A 530 7 537 0 0 530 71 A:530;T:6;C:1 primer_calls_ignored=0/0;total_primer_bases=530/7;unfiltered_depth=537;total=530/7;amplicon_overlap=1;amplicon_totals=530/7;amplicon_names=nCoV-2019_1_pool1 35 4 34 A 532 11 543 0 0 532 11 1 A:532;G:11 primer_calls_ignored=0/0;total_primer_bases=532/11;unfiltered_depth=543;total=532/11;amplicon_overlap=1;amplicon_totals=532/11;amplicon_names=nCoV-2019_1_pool1 36 5 35 C 547 7 554 0 0 536 71 -:4;C:547;T:2;G:1 primer_calls_ignored=0/0;total_primer_bases=536/7;unfiltered_depth=554;total=547/7;amplicon_overlap=1;amplicon_totals=547/7;amplicon_names=nCoV-2019_1_pool1
Proposals:
- Add header line
- Make one column each for A,C,G,T,-
- turn semi-colon separated things into individual columns
- Use primer_calls_ignored, unfiltered_depth etc as column headers, so cells contain just the values
- Add a column with the reference bases
- Remove the redundant column (we have both 0-based and 1-based coords in the reference as two columns (the 30 and 31 in that first row)
So proposed format is:
I'm going to give two columns. Left-hand one is the proposed header text for a column in the tsv, and the right-hand one is the explanation of what it is Header text Meaning Pos.ref coordinate in the ref genome Base.ref base in the ref genome (may be -) Pos.cons corresponding (in MSA) base of consensus Base.cons base in the consensus (may be - ) A Total reads with an A at this position (raw count, no exclusion of primer bases etc) C Total reads with an C at this position (raw count, no exclusion of primer bases etc) G Total reads with an G at this position (raw count, no exclusion of primer bases etc) T Total reads with an T at this position (raw count, no exclusion of primer bases etc)
- Total reads with an gap at this position (raw count, no exclusion of primer bases etc) Depth total raw count of reads at this position (no exclusion of primer bases etc) Tot.cons total reads which agree with consensus, after applying viridian rules (eg dont count reads where this posn is a primer,if there is an overlapping amplicon) Tot.noncons total reads which disagree with the consensus, after applying viridian rules P.ignored.cons total reads had a primer here, which agree with consensus, but which were ignored by viridian applying rules in primers P.ignored.noncons total reads had a primer here, which DISagree with consensus, but which were ignored by viridian applying rules in primers P.notignored.cons total reads had a primer here, which agree with the consensus, and were NOT ignored by viridian, applying rules in primers P.cons total reads which agree with consensus, in a primer (no filtering, raw count) P.noncons total reads which disagree with consensus, in a primer (no filtering, raw count) AmpOv Number of amplicons which overlap this position, after excluding dropped amplicons
— Reply to this email directly, view it on GitHub https://github.com/iqbal-lab-org/viridian_workflow/issues/92, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHWAAGNTM5LW4VH7VDEW3LVYOM2JANCNFSM56EJ4OPA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
There's a flexible engine for dumping out per position stats now. Here's what the head of a random sample looks like:
Pos.ref Base.ref Pos.cons Base.cons A C G T - RawDepth Clean.Tot.cons Clean.Tot.noncons InPrimer SchemeAmpCount P.ignored.cons P.ignored.noncons P.cons P.noncons AmpOv
1 A 0 - . . . . . . . . . 0 . . . . .
2 T 0 - . . . . . . . . . 0 . . . . .
3 T 0 - . . . . . . . . . 0 . . . . .
4 A 0 - . . . . . . . . . 0 . . . . .
5 A 0 - . . . . . . . . . 0 . . . . .
6 A 0 - . . . . . . . . . 0 . . . . .
7 G 0 - . . . . . . . . . 0 . . . . .
8 G 0 - . . . . . . . . . 0 . . . . .
9 T 0 - . . . . . . . . . 0 . . . . .
10 T 0 - . . . . . . . . . 0 . . . . .
11 T 0 - . . . . . . . . . 0 . . . . .
12 A 0 - . . . . . . . . . 0 . . . . .
13 T 0 - . . . . . . . . . 0 . . . . .
14 A 0 - . . . . . . . . . 0 . . . . .
15 C 0 - . . . . . . . . . 0 . . . . .
16 C 0 - . . . . . . . . . 0 . . . . .
17 T 0 - . . . . . . . . . 0 . . . . .
18 T 0 - . . . . . . . . . 0 . . . . .
19 C 0 - . . . . . . . . . 0 . . . . .
20 C 0 - . . . . . . . . . 0 . . . . .
21 C 0 - . . . . . . . . . 1 . . . . .
22 A 0 - . . . . . . . . . 1 . . . . .
23 G 0 - . . . . . . . . . 1 . . . . .
24 G 0 - . . . . . . . . . 1 . . . . .
25 T 0 - . . . . . . . . . 1 . . . . .
25 A 1 A 967 0 0 0 0 967 967 0 967 1 0 0 967 0 1
26 A 2 A 970 0 0 0 0 971 970 1 971 1 0 0 970 1 1
27 C 3 C 0 971 0 0 0 971 971 0 971 1 0 0 971 0 1
28 A 4 A 971 0 1 0 0 972 971 1 972 1 0 0 971 1 1
29 A 5 A 972 0 0 0 0 972 972 0 972 1 0 0 972 0 1
30 A 6 A 972 0 0 0 0 972 972 0 972 1 0 0 972 0 1
31 C 7 C 2 968 0 0 2 972 968 4 972 1 0 0 968 4 1
32 C 8 C 0 971 0 1 0 972 971 1 972 1 0 0 971 1 1
33 A 9 A 972 0 0 0 0 972 972 0 972 1 0 0 972 0 1
I think there's a bug in InPrimer
? If I've lined up the columns correctly, it's got values 967, 971, 971, .... But should be a bool, and present for every ref position, like the SchemeAmpCount
column.
Sorry, I'd preserved the previous definition which was how many of the bases are in primers. I think the bool version can be inferred by what we have, right?
well, the spec above is
boolean for whether this position is in a primer that is seen in the data (if a primer scheme has multiple alternate primers at one end of an amplicon, those which are not seen in this read-set are ignored)
I don't know how the current count compares to this planned bool:
"boolean for whether this position is in a primer that is seen in the data (if a primer scheme has multiple alternate primers at one end of an amplicon, those which are not seen in this read-set are ignored)"
Specifically, can the count be >0 but bool be False? eg does primer only being seen in a single read result in that primer being included in the inferred primer scheme?
Also, can the count be 0 but bool be True? eg when reads suggest an amplicon is present, but none of the primers are found in the reads for that amplicon, we use all primers for that amplicon
Just instead of a bool, we have a count. So 0
instead of False
and the count instead of True
.
The reason the counts are not available for ref positions that aren't represented by consensus calls is because "boolean for whether this position is in a primer that is seen in the data"
So you mean 0==False, and >=1 means True
I think noone wants the bool to be one for primer positions that are never seen in the data, so maybe this works
Have you changed the semantics of any of the other columns or are the other definitions literally true?
I wanted the bool to show the positions that are in primers, in the inferred primer scheme the we use for assembly. This should be present for every position in the reference genome.
that is what this is doing. Jeff is just saying that it wont mention primers that are never seen in any reads
Yes that's right. Counting bases that are in primers is critical for my tests and was an original column. In the slack thread Martin requested an additional bool
column that's true when the count is >0 and Zam combined these two columns in this issue.
I didn't read the details when I implemented this because the only difference between Zam's columns and what we already had is the change from int
to bool
in this one column. Nothing else has changed.
Anyway, there's apparently an off-by-one error in the reference position column.
I now understand what Martin is saying i think, which is not what I thought a couple of comments back. I'll update on Monday
Have just chatted through the TSV format with @jeff-k , and am logging here my proposed changes to the format, which Jeff is happy with
Current format
31 0 30 A 453 1 454 0 0 453 11 A:453;T:1 primer_calls_ignored=0/0;total_primer_bases=453/1;unfiltered_depth=454;total=453/1;amplicon_overlap=1;amplicon_totals=453/1;amplicon_names=nCoV-2019_1_pool1 32 1 31 C 469 1 470 0 0 469 11 C:469;T:1 primer_calls_ignored=0/0;total_primer_bases=469/1;unfiltered_depth=470;total=469/1;amplicon_overlap=1;amplicon_totals=469/1;amplicon_names=nCoV-2019_1_pool1 33 2 32 C 508 1 509 0 0 508 11 C:508;T:1 primer_calls_ignored=0/0;total_primer_bases=508/1;unfiltered_depth=509;total=508/1;amplicon_overlap=1;amplicon_totals=508/1;amplicon_names=nCoV-2019_1_pool1 34 3 33 A 530 7 537 0 0 530 71 A:530;T:6;C:1 primer_calls_ignored=0/0;total_primer_bases=530/7;unfiltered_depth=537;total=530/7;amplicon_overlap=1;amplicon_totals=530/7;amplicon_names=nCoV-2019_1_pool1 35 4 34 A 532 11 543 0 0 532 11 1 A:532;G:11 primer_calls_ignored=0/0;total_primer_bases=532/11;unfiltered_depth=543;total=532/11;amplicon_overlap=1;amplicon_totals=532/11;amplicon_names=nCoV-2019_1_pool1 36 5 35 C 547 7 554 0 0 536 71 -:4;C:547;T:2;G:1 primer_calls_ignored=0/0;total_primer_bases=536/7;unfiltered_depth=554;total=547/7;amplicon_overlap=1;amplicon_totals=547/7;amplicon_names=nCoV-2019_1_pool1
Proposals:
So proposed format is:
I'm going to give two columns. Left-hand one is the proposed header text for a column in the tsv, and the right-hand one is the explanation of what it is