Question 1 from reading Biological Foundations

cphbrt commented 1 year ago

Hi! I'm a representatively ignorant software engineer, so here's what I'm left wondering after reading the Biological Foundations and skimming the rest (may revisit another day).

Confusion Area 1:

DNA has two strands which complement each other according to A<->T and C<->G. If one strand says ATGCATGC, then the mirror strand must say TACGTACG.

Thoughts:

These strands are "equivalent", right?
So it doesn't really make sense to ask "Which one do we write down when sequencing?", right?

I think/thought the answers are "yes" and "correct".

But! In Transcription, one strand is used to create the single strand immature RNA which is turned into the single-strand mature RNA.

If the first strand was used, the mRNA would be AUGCAUGC. If the second strand was used, the mRNA would be UACGUACG. (Or maybe swap those around, because the mRNA is made by mirroring the thing... or whatever... point is, you'd get different things depending on which strand is used to create the mRNA.)

I figure, "sure, still equivalent, technically".

But! In Translation, AUG makes the met codon and UAC makes the tyr codon. Those sound very different! (Same for CCC vs GGG making pro vs gly, etc.)

So... in some way, it does matter "which strand" of the original DNA you're looking at or operating on.

What notion am I missing?

I have another confusion area around chromosomes, haplotype, and genotype that similarly raises a "Which base in which thing is salient?" type question, but I suspect it'll be ameliorated by resolution to the above.

claymcleod commented 1 year ago

These strands are "equivalent", right?

It depends on what you mean by "equivalent": yes, you have written a sequence and its compliment, and in that way, they make up a valid strand of DNA. However, it's important to remember that strands have an orientation as well, and complementary strands are read in opposite directions. This is why we often talk about the reverse compliment of a sequence rather than just its compliment alone. More on this below.

So it doesn't really make sense to ask "Which one do we write down when sequencing?", right?

For what I would consider a typical DNA library preparation and sequencing run, that is correct. We're just interested in quantifying the DNA, so we can take data from both strands. Aligners actually take note which strand the fragment originates from of this in the 0x10 flag for each read (outlined in the SAM specification).

There are some sequencing modalities where it does matter "which one we write down", and that is accounted for in the library preparation (e.g., a protocol called "stranded RNA-Seq").

But! In Translation, AUG makes the met codon and UAC makes the tyr codon. Those sound very different! (Same for CCC vs GGG making pro vs gly, etc.)

Ah I see. I think the important bit to understand here is that the orientation changes when you flip to the other strand. So, instead of reading UACGUACG, the ribosome would read the reverse (GCAUGCAU). This turns out to be a totally different sequence rather than the situation that you're describing. Indeed, there are instances where two genes overlap one another on different strands like so:

 ---GENE A--->
 ||||||||||||
<---GENE B---

I will note that, when you move from talking about DNA to RNA sequencing, the strandedness really matters (at least in today's more modern sequencing world, they used to not capture the strand of RNA transcripts).

cphbrt commented 1 year ago

Wow! Thank you, this is a wealth of information. Notable that the practical complexities of sequencing a sample make it so far into the final data formats. I see now that the Read Mapping page describes the aligner, which determines which strand of which part of the reference genome a given sequence was likely read from.

It blows my mind that perhaps one strand reads left to right while the complementing strand may read right to left in some circumstances. That could as much as double the "information" one would have if only one strand in one direction was salient. Like writing an essay with a second essay encoded in the text in reverse.

tl;dr answer to my "Confusion Area 1": All of the above. Both strands are important. Practically, some of both are sequenced, and the aligner's job is to sort out what strand and section a read sequence belongs to, etc. And translation can occur on either strand, sometimes in different directions, and sometimes in different directions over the same section!

claymcleod commented 1 year ago

Notable that the practical complexities of sequencing a sample make it so far into the final data formats.

I totally agree with this.

Like writing an essay with a second essay encoded in the text in reverse.

It's wild isn't it!

Thanks for the issue, let me know if you run across any other questions.

stjude / learngenomics.dev

Question 1 from reading Biological Foundations #13

Confusion Area 1: