jluebeck / FaNDOM

Fast Nested Distance aligner for Optical Maps
Other
3 stars 1 forks source link

Filtering step : filter_individual.py #8

Open AudreyDub opened 2 years ago

AudreyDub commented 2 years ago

Hello,

On fandom workflow, you have a step to filter alignments, using filter_individual.py I'm not sure to understand exactly what you do at this step, especially in the choice of metrics. Could you explain me why you choose : if abs(a1 - a2) > 40000 and ( (12> abs(separate_lines) > 9 and score > 5000) or (16> abs(separate_lines) > 11 and score > 4500 ) or ( abs(separate_lines) >15 and score >4000 ) ):

especially 40000 value, and why you don't consider separate_lines <9 ? the support is too low?

I want to use Fandom on Bionano cancer samples and i'm not sure this step is necessary because i want to identify rare SV, do you have an opinion ? Can you tell me when this step should be done? maybe it depend on sample or organism?

Best Audrey

siavashre commented 2 years ago

Hi AudreyDub and thank you so much for your question. When we run FaNDOM for reporting partial alignments, it will report lots of partial alignments for each query and some of them can be very short and low confidence. Hence, for filtering them we set some filters and thresholds. First of all, each partial alignment should have a length greater than 40Kbp, otherwise, it can be very short and align to lots of regions. So, 40000 here is the threshold of alignment length. The second filter would apply to the reported score by FaNDOM. This score is calculated based on labels alignments. So, we expect that if the number of labels (here "separate_lines" variable is the number of labels in an alignment) is few in an alignment (between 9 to 12), it should have a higher score compared to be reliable. These conditions in the code represent different thresholds. If "separate_lines" is less than 9 it means that alignment has less than 9 labels and it would be short to consider. If you have any other questions, please let me know.

siavashre commented 2 years ago

Also about your question, I believe this step is always necessary to run and it doesn't depend on the sample or organism. As I explained above, the goal of this step is to remove unreliable and wrong partial alignments call for each query. About running FanDOM on cancer data, again I highly recommend running this script.