shendurelab / MPRAflow

A portable, flexible, parallelized tool for complete processing of massively parallel reporter assay data
Apache License 2.0
31 stars 16 forks source link

Filtering for multiple cigar strings #67

Closed jmbeesley closed 2 years ago

jmbeesley commented 2 years ago

Hi

Thanks for the protocol and software.

We're using the approach to look at SNP effects. We've included both SNVs and indels in the enhancer oligos. Most of our inserts are 201 nt (100 nt either side of SNVs), but for indels the inserts vary from 199 to 205 nt. We're following the instructions for SNV alignment and filtering (--cigar 201M --mapq 1), but wondering if there's a way to deal with the indels? For example, is there any way to filter on multiple cigar strings to extract the barcodes for these variants?

Any advice would be greatly appreciated.

Thanks,

Jonathan

visze commented 2 years ago

Hi Jonathan,

sorry for the late response. I didn't saw it.

Yes I see your point that you need different cigar strings for filtering. In theory we can modify the code to be able to use multiple cigar strings (I am happy if someone makes a pull request with the changes).

But you have simple two options in your case.

  1. Don't use a cigar string. The option is optional
  2. Workaround: You can split up your design into the N different cigar strings you have. Then you can run the association workflow N times, each with the different design file (but will all input reads). Your output will be N association files (pickle format). They can be merged into one (I can help with that, of course). Then the count workflow can be used with the merged association files.

The second option might be the best one and is even better than allowing multiple cigar strings. Because you will match them directly with your designed sequence file.

jmbeesley commented 2 years ago

Thanks for the suggestions. I'll have a go at option 2.

Jonathan

visze commented 2 years ago

Great. I will close the ticket. Please reopen if you need more help here