Implement mixture model compatibility in full pipeline

isaacvock commented 6 months ago

Fully addresses #153 and extends upon work done in #154 .

Tests are looking good. Ended up analyzing a fastq from the "classic" BRD4-MYC paper to test the new pipeline. Confirmed that it works and the fraction new (or new-to-total ratio (NTR)) estimates are as expected given a 1 hour feed (i.e., a little under 10% labeled, or a logit(NTR) of < -2.5):

$SLAMDUNK_EZbakR_fraction_news$

Same distribution obtained from GRAND-SLAM is:

$GRAND_SLAM_fraction_news$

In addition, bit of a niche analysis but nice to see, transcripts from the X chromosome are on average more stable (lower NTR) than those on other chromosomes. This was originally published here, and is something I have confirmed in multiple independent datasets collected in human cell lines:

$SLAMDUNK_EZbakR_fraction_news_Xchrstratify$

Work left to do:

[x] Compare gene-wise average logit(NTR) estimates between GRAND-SLAM and SLAMDUNK + bakR directly to make sure estimates are gene-by-gene consistent
[x] Run the test dataset that is in the SLAMDUNK repository
[x] Anything else?

isaacvock commented 6 months ago

GRAND-SLAM vs. SLAMDUNK + bakR fraction new comparison looks about as correlated as I hoped:

I also compared the information content of the standard tcount.tsv file to the new cB file. Read counts are identical, which is good to see:

but raw ConversionRates are sometimes slightly different, with the _tcount.tsv file always having the same or higher ConversionRate than the new cB file has:

Looking at the tables reveals that this is due to there being slightly more Ts in the cB than the _tcount file:

Problem was I was using the T count after it had been incremented by the number of mutations:

            testN = read.getTcount()
            testk = 0
            for mismatch in read.mismatches:
                if(mismatch.referencePosition >= 0 and mismatch.referencePosition < utr.getLength()):
                    if(mismatch.isT(read.direction == ReadDirection.Reverse)):
                        testN += 1
                    if(mismatch.isTCMismatch(read.direction == ReadDirection.Reverse)):
                        testk += 1

            if (makeCB):

                key = (testk, testN)

                if key in cB:
                    cB[key] += 1

                else:
                    cB[key] = 1

so I changed it accordingly to:

           ...

            if (makeCB):

                key = (testk, read.getTcount())

                if key in cB:
                    cB[key] += 1

                else:
                    cB[key] = 1

t-neumann commented 6 months ago

Sorry for the delay in replying, I was on a course the whole week and will check it until next week

isaacvock commented 6 months ago

It's no problem, I hope the course went well.

I am actually still dealing with a discrepancy between the total conversionRate reported in the _tcount.tsv and in the cB file. The change I described in my last message has a minimal impact on the Tcoverage and thus conversionRates. So it is still the case that the conversionRate computed from the data in the cB file is often lower than that reported in the _tcount.tsv. I need to spend some time working through the details of computeTconversions() more carefully, but if you have any insights as to the source of the discrepancy, that would be much appreciated.

To be clear, I am getting each read's T count from the getTcount() method in the readInRegion class, and the total mutation counts in the cB exactly match the counts reported in _tcount.tsv

t-neumann commented 6 months ago

Hi @isaacvock

so to get this right, you are getting the identical T>C conversions for each read from my code and your new cb code, but the T count is off?

isaacvock commented 6 months ago

Exactly, with the new code often counting more Ts (see the very last plot in my most recent long message).

t-neumann commented 6 months ago

Hm that's a bit odd because for getting the Ts/As for a given read, we really simply take the read.sequence property from the pysam read object and count the Ts, so there should actually be no hidden layer in there that can really deviate.

    def getTcount(self):
        if(self.direction == ReadDirection.Reverse):
            return self.sequence.count("a") + self.sequence.count("A")
        else:
            return self.sequence.count("t") + self.sequence.count("T")

How are you counting the Ts?

isaacvock commented 6 months ago

Yeah, though I am technically comparing the coverageOnTs column of the _tcounts.tsv output to the total number of Ts counted for each UTR in the cB file (so using dplyr syntax: `cB %>% group_by(UTR) %>% summarise(cB_CoverageOnTs = sum(nTn)), whereUTRis short hand for all of the UTR specification columns). When I say the mutational content is identical, I mean that conversionsOnTs is equivalent to the same summarization of the cB file but forsum(TC*n)`.

Maybe I am just misunderstanding the coverageOnTs output though. Would you expect this to be equal to the sum of all of the T counts in all of the reads from a given UTR?

t-neumann commented 6 months ago

Maybe here is the confusion - the coverageOnTs should be the number of times a reference T is covered by a read. That does not imply that the read itself has a T on this position, it could also be a T>C conversion (or any other mutation)

isaacvock commented 6 months ago

Fair point and that definitely is a bug in my current code that I will fix (i.e., I want nT column of cB to track the number of Us in the original RNA prior to chemical conversion), but barring lots of A/C/G > T conversions in the read, that should cause me to undercount the number of Ts in reads, not overcount it.

I think the problem is that the way that I am counting mutations, and the way you calculate covargeOnTs and conversionsOnTs all filter out nucleotides in the read that do not lie completely within the annotated UTR. In other words, there is always some if statement of the form if(position >= 0 & position < utr.length() that needs to be TRUE in order for the relevant counter to be incremented. For counting Ts, I am just counting the number of Ts in each read regardless of whether it falls outside of the annotated UTR bounds.

So I'll work on fixing both what I am counting (number of Ts that would have existed if there were no mismatches) and when I am counting it (don't count nucleotides not aligning to the annotated UTR region).

isaacvock commented 6 months ago

Actually, thinking of just counting the total number of covered reference Ts and total number of T-to-C mutations in each read, not removing counts for nucleotides aligning to regions outside of the annotated UTR. So relevant code to get the TC and nT columns of cB is:

for read in readIterator:

    # Total number of T-to-C mutations in the read
    totalTCcnt = read.tcCount

    # Total number of reference Ts overlapped (minus those mutated to something other than C)
    totalTcnt = read.getTcount()

    ...

    # Increment nT count by A/C/G/N > T mismatches
    for mismatch in read.mismatches:
        if(mismatch.referencePosition >= 0 and mismatch.referencePosition < utr.getLength()):
            if(mismatch.isT(read.direction == ReadDirection.Reverse)):
               ...
                totalTcnt += 1
               ...

        else:
            if(mismatch.isT(read.direction == ReadDirection.Reverse)):
                totalTcnt += 1

    # Increment TC and nT counts for cB
    if (makeCB):

        key = (totalTCcnt, totalTcnt)

        if key in cB:
            cB[key] += 1

        else:
            cB[key] = 1

I'll run some tests in the next couple days, but let me know if you have any objections to this strategy.

isaacvock commented 6 months ago

Hi @t-neumann,

All tests have passed (ran on full dataset I previously mentioned as well as the small test dataset you have hosted on the repo; made sure cB file creation worked with slamdunk all and slamdunk count, and also made sure that not including the new cB file flag led to the cB file not getting created) and things look good. Conversion rates inferred from cB agree well with SLAMDUNK _tcount.tsv conversion rates (left plot) and new-to-total ratios (NTRs) estimated with GRAND-SLAM agree well with those estimated with the new cB file (right plot).

Is there anything else you would like me to check/tests you would like me to run? If not, then it is ready to officially merge into the cb branch when you are ready.

Also, can you point me to where the documentation is hosted? I didn't find it within the slamdunk repo, and I'd be happy to make a pull request to update the docs as necessary.

Best, Isaac

t-neumann commented 6 months ago

Hi @isaacvock

yes please just submit the remaining changes. I would also like to test a few things on my end with a second pair of eyes, but the changes sound good to me.

For the documentation, you actually need to create a different PR on the gh-pages branch which hosts the documentation (that is automatically rebuilt). You basically would need to edit the *.rst files in doc/source/.

Cheers,

Tobi

t-neumann commented 6 months ago

Btw did you resolve the issue with the unmatching numbers of covered Ts?

isaacvock commented 6 months ago

There are no additional changes to submit. And thank you for the information regarding the docs, I will try and make the necessary PR to update them sometime this week.

As I mentioned in my previous messages, I ended up just counting all of the T-to-C conversions and reference Ts in a read, regardless of whether the nucleotides were contained within the annotated UTR. Thus, the number of covered Ts and T-to-C conversions will be overall higher in the cB than in the _tcounts file, but the two files will agree on the overall mutational content, which is what matters. See the conversion rate comparison plot in my last message for additional context; the problem before was that the _tcount conversion rates were consistently higher than those computed from the cB.

t-neumann commented 6 months ago

OK thanks a lot. Please let me know if you need further directions for the docs.

Please be patient with me to merge this PR as I want to maybe incorporate a few bugfixes that have been lying around and properly test the whole thing to release a whole new slamdunk version. I guess that works for you since you have already a local working version with your changes correct?

isaacvock commented 6 months ago

Thank you, will do.

Yeah no rush, I appreciate your help throughout this process and any time you can find to work on eventually getting this out. Let me know if you have any further questions about the changes I made.

Best, Isaac

t-neumann / slamdunk

Implement mixture model compatibility in full pipeline #155