shendurelab / MPRAflow

A portable, flexible, parallelized tool for complete processing of massively parallel reporter assay data
Apache License 2.0
31 stars 16 forks source link

Association filtering steps #41

Closed colinshew closed 2 years ago

colinshew commented 3 years ago

Hi,

I would like help understanding the output from the Association utility, specifically to identify the reasons certain CRSs are dropped from the analysis. Are any additional filters are applied aside from the user-controlled ones? For example, I pulled out the CRSs that had zero barcode assignments and found they still had reads aligning to them at high enough coverage/MAPQ/percent overlap. Are there other reasons reads/barcodes might be filtered (for example, assignment to multiple CRSs)? My design includes variants of the same sequence, so there is lots of multimapping, but in many cases there is a clear best alignment (AS>XS).

Also, am I correct in thinking that ${name}_coords_to_barcodes.pickle represents all barcode associations, and ${name}_filtered_coords_to_barcodes.pickle represents the valid associations after filtering is applied? I'm a little confused because the distribution of barcodes/CRS differs before filtering when I use different parameters.

Finally, I have an output file ${name}_barcode_counts.pickle that is not mentioned in the docs but appears to contain counts per barcode sequence. However, the total number of barcodes is intermediate to the number in the filtered or unfiltered CRS dictionaries. This doesn't appear to be used downstream, but I'm curious why it doesn't match and if this is related to the above questions.

I'm happy to provide concrete examples if that would help, but it sounds like I am probably just misunderstanding the workflow/output files. Thanks for any explanation!

Colin

visze commented 3 years ago

Hi Colin,

sorry for my late reply. I was long time away and sadly no one else found time to help. Just to get it right. By CRS you mean your targets in the design file?

When you have variants designed you should have two targets per variant in your design file. One with the reference, one with the alternative. For SNVs the sequences will differ only by one nucleotide. This really drops the MAPQ. So you should run the workflow with something like --mapq 1 (also 0 is fine) instead of the default (30 is way to high!).

Also your design file should not include identical sequences. Then we have no single best hit and we will filter out the reads. Only reads that map to one target with the maximum score are selected. When matching to multiple (on their max mapping score). They will be removed.

About filtering you are correct. After filtering only the targets (and its barcodes) are present with a minimum coverage (default 3) and a minimum fraction of BC map to single insert (default 0.5). The first file (${name}_coords_to_barcodes.pickle) might differ when you use different MAPQ or BASEQ cutoff values (or also rerun the mapping, not deterministic).

About ${name}_barcode_counts.pickle: You are right. This file is not documented and we should do this (even we do not use this file further)! It shows how many times we see a single BC (and not where it is assigned to). You can have a look at the file to see if the numbers are uniform or if there are any outliers (like PCR artifacts with high number of counts). The BCs you find in this file should be the same you will find in the assignment file (before filtering). Otherwise they are not comparable.

Hope this helps a bit understanding the steps. Please let me know if you need further help.

Best, Max

lotard commented 2 years ago

Two small notes:

visze commented 2 years ago

Thank you.

I will change the mapq issue using the minumum (change to else here: https://github.com/shendurelab/MPRAflow/blob/261b3d75dd31bed0ac24b30f32675d2d1ee700b1/src/nf_ori_map_barcodes.py#L96)

About your second issue:

Identical sequences in the design file is nonsense, because the assay cannot distinguish between them. Also bwa will map the read somehow randomly to one reference and marks it as second hit on the other identical regions. Every readis (of course) only counted once. So that's why you are getting different counts on identical sequences in the design file.

visze commented 2 years ago

close because of inactivity

please open again if needed