iqbal-lab-org / viridian

MIT License
15 stars 5 forks source link

Format changes/cleanup of the TSV file #92

Closed iqbal-lab closed 2 years ago

iqbal-lab commented 2 years ago

Have just chatted through the TSV format with @jeff-k , and am logging here my proposed changes to the format, which Jeff is happy with

Current format 31 0 30 A 453 1 454 0 0 453 11 A:453;T:1 primer_calls_ignored=0/0;total_primer_bases=453/1;unfiltered_depth=454;total=453/1;amplicon_overlap=1;amplicon_totals=453/1;amplicon_names=nCoV-2019_1_pool1 32 1 31 C 469 1 470 0 0 469 11 C:469;T:1 primer_calls_ignored=0/0;total_primer_bases=469/1;unfiltered_depth=470;total=469/1;amplicon_overlap=1;amplicon_totals=469/1;amplicon_names=nCoV-2019_1_pool1 33 2 32 C 508 1 509 0 0 508 11 C:508;T:1 primer_calls_ignored=0/0;total_primer_bases=508/1;unfiltered_depth=509;total=508/1;amplicon_overlap=1;amplicon_totals=508/1;amplicon_names=nCoV-2019_1_pool1 34 3 33 A 530 7 537 0 0 530 71 A:530;T:6;C:1 primer_calls_ignored=0/0;total_primer_bases=530/7;unfiltered_depth=537;total=530/7;amplicon_overlap=1;amplicon_totals=530/7;amplicon_names=nCoV-2019_1_pool1 35 4 34 A 532 11 543 0 0 532 11 1 A:532;G:11 primer_calls_ignored=0/0;total_primer_bases=532/11;unfiltered_depth=543;total=532/11;amplicon_overlap=1;amplicon_totals=532/11;amplicon_names=nCoV-2019_1_pool1 36 5 35 C 547 7 554 0 0 536 71 -:4;C:547;T:2;G:1 primer_calls_ignored=0/0;total_primer_bases=536/7;unfiltered_depth=554;total=547/7;amplicon_overlap=1;amplicon_totals=547/7;amplicon_names=nCoV-2019_1_pool1

Proposals:

So proposed format is:

I'm going to give two columns. Left-hand one is the proposed header text for a column in the tsv, and the right-hand one is the explanation of what it is

Header text Meaning
Pos.ref coordinate in the ref genome
Base.ref base in the ref genome (may be -)
Pos.cons corresponding (in MSA) base of consensus
Base.cons base in the consensus (may be - )
A Total reads with an A at this position (raw count, no exclusion of primer bases etc)
C Total reads with an C at this position (raw count, no exclusion of primer bases etc)
G Total reads with an G at this position (raw count, no exclusion of primer bases etc)
T Total reads with an T at this position (raw count, no exclusion of primer bases etc)
- Total reads with an gap at this position (raw count, no exclusion of primer bases etc)
RawDepth total raw count of reads at this position (no exclusion of primer bases etc)
Clean.Tot.cons total reads which agree with consensus, after applying viridian rules (eg dont count reads where this posn is a primer,if there is an overlapping amplicon); "clean" read support
Clean.Tot.noncons total reads which disagree with the consensus, after applying viridian rules, disagreement with consensus from "clean" reads
InPrimer boolean for whether this position is in a primer that is seen in the data (if a primer scheme has multiple alternate primers at one end of an amplicon, those which are not seen in this read-set are ignored)
SchemeAmpCount Number of amplicons overlapping this position in the amplicon scheme (property only of scheme, not the reads/data)
P.ignored.cons total reads had a primer here, which agree with consensus, but which were ignored by viridian applying rules in primers
P.ignored.noncons total reads had a primer here, which DISagree with consensus, but which were ignored by viridian applying rules in primers
P.notignored.cons total reads had a primer here, which agree with the consensus, and were NOT ignored by viridian, applying rules in primers
P.cons total reads which agree with consensus, in a primer (no filtering, raw count)
P.noncons total reads which disagree with consensus, in a primer (no filtering, raw count)
AmpOv Number of amplicons which overlap this position, after excluding dropped amplicons
bede commented 2 years ago

This looks super helpful for casually interrogating outputs 🙏

On Wed, 10 Aug 2022 at 13:18, Zamin Iqbal @.***> wrote:

Have just chatted through the TSV format with @jeff-k https://github.com/jeff-k , and am logging here my proposed changes to the format, which Jeff is happy with

Current format 31 0 30 A 453 1 454 0 0 453 11 A:453;T:1 primer_calls_ignored=0/0;total_primer_bases=453/1;unfiltered_depth=454;total=453/1;amplicon_overlap=1;amplicon_totals=453/1;amplicon_names=nCoV-2019_1_pool1 32 1 31 C 469 1 470 0 0 469 11 C:469;T:1 primer_calls_ignored=0/0;total_primer_bases=469/1;unfiltered_depth=470;total=469/1;amplicon_overlap=1;amplicon_totals=469/1;amplicon_names=nCoV-2019_1_pool1 33 2 32 C 508 1 509 0 0 508 11 C:508;T:1 primer_calls_ignored=0/0;total_primer_bases=508/1;unfiltered_depth=509;total=508/1;amplicon_overlap=1;amplicon_totals=508/1;amplicon_names=nCoV-2019_1_pool1 34 3 33 A 530 7 537 0 0 530 71 A:530;T:6;C:1 primer_calls_ignored=0/0;total_primer_bases=530/7;unfiltered_depth=537;total=530/7;amplicon_overlap=1;amplicon_totals=530/7;amplicon_names=nCoV-2019_1_pool1 35 4 34 A 532 11 543 0 0 532 11 1 A:532;G:11 primer_calls_ignored=0/0;total_primer_bases=532/11;unfiltered_depth=543;total=532/11;amplicon_overlap=1;amplicon_totals=532/11;amplicon_names=nCoV-2019_1_pool1 36 5 35 C 547 7 554 0 0 536 71 -:4;C:547;T:2;G:1 primer_calls_ignored=0/0;total_primer_bases=536/7;unfiltered_depth=554;total=547/7;amplicon_overlap=1;amplicon_totals=547/7;amplicon_names=nCoV-2019_1_pool1

Proposals:

  • Add header line
  • Make one column each for A,C,G,T,-
  • turn semi-colon separated things into individual columns
  • Use primer_calls_ignored, unfiltered_depth etc as column headers, so cells contain just the values
  • Add a column with the reference bases
  • Remove the redundant column (we have both 0-based and 1-based coords in the reference as two columns (the 30 and 31 in that first row)

So proposed format is:

I'm going to give two columns. Left-hand one is the proposed header text for a column in the tsv, and the right-hand one is the explanation of what it is Header text Meaning Pos.ref coordinate in the ref genome Base.ref base in the ref genome (may be -) Pos.cons corresponding (in MSA) base of consensus Base.cons base in the consensus (may be - ) A Total reads with an A at this position (raw count, no exclusion of primer bases etc) C Total reads with an C at this position (raw count, no exclusion of primer bases etc) G Total reads with an G at this position (raw count, no exclusion of primer bases etc) T Total reads with an T at this position (raw count, no exclusion of primer bases etc)

  • Total reads with an gap at this position (raw count, no exclusion of primer bases etc) Depth total raw count of reads at this position (no exclusion of primer bases etc) Tot.cons total reads which agree with consensus, after applying viridian rules (eg dont count reads where this posn is a primer,if there is an overlapping amplicon) Tot.noncons total reads which disagree with the consensus, after applying viridian rules P.ignored.cons total reads had a primer here, which agree with consensus, but which were ignored by viridian applying rules in primers P.ignored.noncons total reads had a primer here, which DISagree with consensus, but which were ignored by viridian applying rules in primers P.notignored.cons total reads had a primer here, which agree with the consensus, and were NOT ignored by viridian, applying rules in primers P.cons total reads which agree with consensus, in a primer (no filtering, raw count) P.noncons total reads which disagree with consensus, in a primer (no filtering, raw count) AmpOv Number of amplicons which overlap this position, after excluding dropped amplicons

— Reply to this email directly, view it on GitHub https://github.com/iqbal-lab-org/viridian_workflow/issues/92, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHWAAGNTM5LW4VH7VDEW3LVYOM2JANCNFSM56EJ4OPA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jeff-k commented 2 years ago

There's a flexible engine for dumping out per position stats now. Here's what the head of a random sample looks like:

Pos.ref Base.ref    Pos.cons    Base.cons   A   C   G   T   -   RawDepth    Clean.Tot.cons  Clean.Tot.noncons   InPrimer    SchemeAmpCount  P.ignored.cons  P.ignored.noncons   P.cons  P.noncons   AmpOv
1   A   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
2   T   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
3   T   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
4   A   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
5   A   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
6   A   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
7   G   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
8   G   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
9   T   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
10  T   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
11  T   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
12  A   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
13  T   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
14  A   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
15  C   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
16  C   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
17  T   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
18  T   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
19  C   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
20  C   0   -   .   .   .   .   .   .   .   .   .   0   .   .   .   .   .
21  C   0   -   .   .   .   .   .   .   .   .   .   1   .   .   .   .   .
22  A   0   -   .   .   .   .   .   .   .   .   .   1   .   .   .   .   .
23  G   0   -   .   .   .   .   .   .   .   .   .   1   .   .   .   .   .
24  G   0   -   .   .   .   .   .   .   .   .   .   1   .   .   .   .   .
25  T   0   -   .   .   .   .   .   .   .   .   .   1   .   .   .   .   .
25  A   1   A   967 0   0   0   0   967 967 0   967 1   0   0   967 0   1
26  A   2   A   970 0   0   0   0   971 970 1   971 1   0   0   970 1   1
27  C   3   C   0   971 0   0   0   971 971 0   971 1   0   0   971 0   1
28  A   4   A   971 0   1   0   0   972 971 1   972 1   0   0   971 1   1
29  A   5   A   972 0   0   0   0   972 972 0   972 1   0   0   972 0   1
30  A   6   A   972 0   0   0   0   972 972 0   972 1   0   0   972 0   1
31  C   7   C   2   968 0   0   2   972 968 4   972 1   0   0   968 4   1
32  C   8   C   0   971 0   1   0   972 971 1   972 1   0   0   971 1   1
33  A   9   A   972 0   0   0   0   972 972 0   972 1   0   0   972 0   1
martinghunt commented 2 years ago

I think there's a bug in InPrimer? If I've lined up the columns correctly, it's got values 967, 971, 971, .... But should be a bool, and present for every ref position, like the SchemeAmpCount column.

jeff-k commented 2 years ago

Sorry, I'd preserved the previous definition which was how many of the bases are in primers. I think the bool version can be inferred by what we have, right?

iqbal-lab commented 2 years ago

well, the spec above is

boolean for whether this position is in a primer that is seen in the data (if a primer scheme has multiple alternate primers at one end of an amplicon, those which are not seen in this read-set are ignored)

martinghunt commented 2 years ago

I don't know how the current count compares to this planned bool:

"boolean for whether this position is in a primer that is seen in the data (if a primer scheme has multiple alternate primers at one end of an amplicon, those which are not seen in this read-set are ignored)"

Specifically, can the count be >0 but bool be False? eg does primer only being seen in a single read result in that primer being included in the inferred primer scheme?

Also, can the count be 0 but bool be True? eg when reads suggest an amplicon is present, but none of the primers are found in the reads for that amplicon, we use all primers for that amplicon

jeff-k commented 2 years ago

Just instead of a bool, we have a count. So 0 instead of False and the count instead of True.

The reason the counts are not available for ref positions that aren't represented by consensus calls is because "boolean for whether this position is in a primer that is seen in the data"

iqbal-lab commented 2 years ago

So you mean 0==False, and >=1 means True

iqbal-lab commented 2 years ago

I think noone wants the bool to be one for primer positions that are never seen in the data, so maybe this works

iqbal-lab commented 2 years ago

Have you changed the semantics of any of the other columns or are the other definitions literally true?

martinghunt commented 2 years ago

I wanted the bool to show the positions that are in primers, in the inferred primer scheme the we use for assembly. This should be present for every position in the reference genome.

image
iqbal-lab commented 2 years ago

that is what this is doing. Jeff is just saying that it wont mention primers that are never seen in any reads

jeff-k commented 2 years ago

Yes that's right. Counting bases that are in primers is critical for my tests and was an original column. In the slack thread Martin requested an additional bool column that's true when the count is >0 and Zam combined these two columns in this issue.

I didn't read the details when I implemented this because the only difference between Zam's columns and what we already had is the change from int to bool in this one column. Nothing else has changed.

Anyway, there's apparently an off-by-one error in the reference position column.

iqbal-lab commented 2 years ago

I now understand what Martin is saying i think, which is not what I thought a couple of comments back. I'll update on Monday