matsengrp / TemplatedMutagenesis-1

0 stars 0 forks source link

Play with Bowtie2 #2

Closed matsen closed 5 years ago

matsen commented 7 years ago

Rob's idea was to subset sequences by trimming them on (I think) the 5' end, because Bowtie will truncate these reads on the 3' end itself because it's thinking that our sequence quality will go down. If we want to truncate on the 3' end, then we do so then reverse complement.

In any case it'd be good to just take some slightly mutated V genes, subset them, then search for them. I anticipate that we'll need to play with settings so we don't get zillions of hits.

Rob said that if we made our Bowtie database using each V gene as a chromosome, then the output will be easier to manage.

jfukuyama commented 7 years ago

I said the other day that I thought this would be difficult, but I've changed my mind. If we have a reference sequence seed_test.fa, a query seed_test_query.fq, and a seed of length 4, we should be able to get all the alignments where the query matches the reference at the first 4 positions and their lengths past that with

bowtie2-build seed_test.fa seed_test
bowtie2 -x seed_test -U seed_test_query.fq \
    -N 0 \
    -L 4 \
    -a \
    --no-1mm-upfront \
    -i L,0,1 \
    --local \
    --score-min L,4,0 \
    --mp 50,50 \
    --ma 1 \
    --norc \
    --ignore-quals \
    -S seed_test.sam

The important flags are

To get the matches on both sides of the seed we would have to do this twice, and I think the most difficult part would probably be merging the two alignment files together.

matsen commented 7 years ago

Nice work!

To get the matches on both sides of the seed we would have to do this twice

We'd need to do it twice, once with the original sequences, and the other time with the reverse complement? And I think this was mentioned by Rob early on?

https://ccb.jhu.edu/software/FLASH/ is a canonical paired end seq merger.

jfukuyama commented 7 years ago

Yeah that's right, for each seed we would create two queries, one starting with the seed in the original orientation, the other starting with the reverse complement of the seed.

I think our merging problem is different from the FLASH problem --- they know which reads are paired together and are trying to find the right overlap. We will have two sets of alignments and will need to find pairings between members of the two sets.

matsen commented 7 years ago

OK. Thanks for persisting with this approach.

I see what you are saying re FLASH. In our case, perhaps the merging problem is trivial because we are demanding that the reads are identical at the seed.

I'm thinking that the workflow is as follows.

  1. Find a mutation in a read X, say at position i
  2. For this mutation, we make two queries: X-i-fwd, and X-i-rev, each of which includes some "seed", which is the read including some number of bases on either side. E.g. if we have a seed of length 5, then this is 2 bases on either side.
  3. Run Bowtie on both of these. If they map to the same* location on the same "chromosome" (i.e. germline gene) then we merge their matches

    * same = same modulo the length of the seed.

In contrast, a Python implementation would be

  1. Find a mutation in a read X, say at position i
  2. Define the seed for that position and read
  3. Search this across all the germline genes
  4. For each hit, extend as much as possible (just using a for loop and checking each base moving away)

Is that about right?

jfukuyama commented 7 years ago

Yup that is what I was thinking too.

I agree that the merging problem isn't difficult conceptually, the issue will just be getting the information out of the bowtie output efficiently.

I put some python code up that does the string matching, but I think in the long run it will be better to figure out how to do this with bowtie because it is more flexible and will allow us to look for fuzzy matches more easily.

matsen commented 7 years ago

Sounds good. Perhaps it's worth setting up some really basic examples so we can do some side-by-side comparison to sanity check both? I mean, bowtie's a complicated program so there may be some things we don't understand that it's doing.