nrminor / oneroof

Base-, Variant-, and Consensus-calling under One Proverbial Roof. Work in progress!
MIT License
5 stars 4 forks source link

Figure out how to handle malformed bed files like the BED below where all primers are forward strand oriented #21

Closed nrminor closed 1 month ago

nrminor commented 1 month ago

The primer handling in the pipeline currently fails when primer start positions precede primer start positions, like in the primer BED file for the SARS-CoV-2 "MIDNIGHT" d1200 ARTIC primers:

NC_045512.2 30  54  SARSCoV_1200_1_LEFT 0   +
NC_045512.2 1205    1183    SARSCoV_1200_1_RIGHT    0   +
NC_045512.2 2153    2179    SARSCoV_1200_3_LEFT 0   +
NC_045512.2 3257    3235    SARSCoV_1200_3_RIGHT    0   +
NC_045512.2 4167    4189    SARSCoV_1200_5_LEFT 0   +
NC_045512.2 5359    5337    SARSCoV_1200_5_RIGHT    0   +
NC_045512.2 6283    6307    SARSCoV_1200_7_LEFT 0   +
NC_045512.2 7401    7379    SARSCoV_1200_7_RIGHT    0   +
NC_045512.2 8253    8282    SARSCoV_1200_9_LEFT 0   +
NC_045512.2 9400    9378    SARSCoV_1200_9_RIGHT    0   +
NC_045512.2 10343   10370   SARSCoV_1200_11_LEFT    0   +
NC_045512.2 11469   11447   SARSCoV_1200_11_RIGHT   0   +
NC_045512.2 12450   12473   SARSCoV_1200_13_LEFT    0   +
NC_045512.2 13621   13599   SARSCoV_1200_13_RIGHT   0   +
NC_045512.2 14540   14568   SARSCoV_1200_15_LEFT    0   +
NC_045512.2 15735   15713   SARSCoV_1200_15_RIGHT   0   +
NC_045512.2 16624   16647   SARSCoV_1200_17_LEFT    0   +
NC_045512.2 17754   17732   SARSCoV_1200_17_RIGHT   0   +
NC_045512.2 18596   18618   SARSCoV_1200_19_LEFT    0   +
NC_045512.2 19678   19655   SARSCoV_1200_19_RIGHT   0   +
NC_045512.2 20553   20581   SARSCoV_1200_21_LEFT    0   +
NC_045512.2 21642   21620   SARSCoV_1200_21_RIGHT   0   +
NC_045512.2 22511   22537   SARSCoV_1200_23_LEFT    0   +
NC_045512.2 23631   23609   SARSCoV_1200_23_RIGHT   0   +
NC_045512.2 24633   24658   SARSCoV_1200_25_LEFT    0   +
NC_045512.2 25790   25768   SARSCoV_1200_25_RIGHT   0   +
NC_045512.2 26744   26766   SARSCoV_1200_27_LEFT    0   +
NC_045512.2 27894   27872   SARSCoV_1200_27_RIGHT   0   +
NC_045512.2 28677   28699   SARSCoV_1200_29_LEFT    0   +
NC_045512.2 29790   29768   SARSCoV_1200_29_RIGHT   0   +
NC_045512.2 1100    1128    SARSCoV_1200_2_LEFT 0   +
NC_045512.2 2266    2244    SARSCoV_1200_2_RIGHT    0   +
NC_045512.2 3144    3166    SARSCoV_1200_4_LEFT 0   +
NC_045512.2 4262    4240    SARSCoV_1200_4_RIGHT    0   +
NC_045512.2 5257    5286    SARSCoV_1200_6_LEFT 0   +
NC_045512.2 6380    6358    SARSCoV_1200_6_RIGHT    0   +
NC_045512.2 7298    7328    SARSCoV_1200_8_LEFT 0   +
NC_045512.2 8385    8363    SARSCoV_1200_8_RIGHT    0   +
NC_045512.2 9303    9327    SARSCoV_1200_10_LEFT    0   +
NC_045512.2 10451   10429   SARSCoV_1200_10_RIGHT   0   +
NC_045512.2 11372   11394   SARSCoV_1200_12_LEFT    0   +
NC_045512.2 12560   12538   SARSCoV_1200_12_RIGHT   0   +
NC_045512.2 13509   13532   SARSCoV_1200_14_LEFT    0   +
NC_045512.2 14641   14619   SARSCoV_1200_14_RIGHT   0   +
NC_045512.2 15608   15634   SARSCoV_1200_16_LEFT    0   +
NC_045512.2 16720   16698   SARSCoV_1200_16_RIGHT   0   +
NC_045512.2 17622   17649   SARSCoV_1200_18_LEFT    0   +
NC_045512.2 18706   18684   SARSCoV_1200_18_RIGHT   0   +
NC_045512.2 19574   19604   SARSCoV_1200_20_LEFT    0   +
NC_045512.2 20698   20676   SARSCoV_1200_20_RIGHT   0   +
NC_045512.2 21532   21562   SARSCoV_1200_22_LEFT    0   +
NC_045512.2 22612   22590   SARSCoV_1200_22_RIGHT   0   +
NC_045512.2 23518   23544   SARSCoV_1200_24_LEFT    0   +
NC_045512.2 24736   24714   SARSCoV_1200_24_RIGHT   0   +
NC_045512.2 25690   25712   SARSCoV_1200_26_LEFT    0   +
NC_045512.2 26857   26835   SARSCoV_1200_26_RIGHT   0   +
NC_045512.2 27784   27808   SARSCoV_1200_28_LEFT    0   +
NC_045512.2 29007   28985   SARSCoV_1200_28_RIGHT   0   +
NC_045512.2 22590   22612   nCoV_1200_22_Right      0   -
NC_045512.3 22511   22537   nCoV_1200_23_Left_Omicron   0   +
NC_045512.4 25690   25712   nCoV_1200_26_Left_Omicron   0   +
NC_045512.5 27784   27808   nCoV_1200_28_Left_Omicron   0   +
NC_045512.6 21675   21700   nCov_ARTIC_V4_71_Right  0   -

To solve this, we'll need to add a step that scans the input BED file, finds any rows where start preceded stop, flip the two, and set the final column to "-". The script written for this step can eventually be expanded to a script that handles all primer BED file validation, which has previously been a sore spot for the pipeline.

nrminor commented 1 month ago

Here's an answer key for a BED file that will work:

NC_045512.2 30  54  nCoV-2019_1_LEFT    1   +
NC_045512.2 1183    1205    nCoV-2019_1_RIGHT   1   -
NC_045512.2 1100    1128    nCoV-2019_2_LEFT    2   +
NC_045512.2 2244    2266    nCoV-2019_2_RIGHT   2   -
NC_045512.2 2153    2179    nCoV-2019_3_LEFT    1   +
NC_045512.2 3235    3257    nCoV-2019_3_RIGHT   1   -
NC_045512.2 3144    3166    nCoV-2019_4_LEFT    2   +
NC_045512.2 4240    4262    nCoV-2019_4_RIGHT   2   -
NC_045512.2 4167    4189    nCoV-2019_5_LEFT    1   +
NC_045512.2 5337    5359    nCoV-2019_5_RIGHT   1   -
NC_045512.2 5257    5286    nCoV-2019_6_LEFT    2   +
NC_045512.2 6358    6380    nCoV-2019_6_RIGHT   2   -
NC_045512.2 6283    6307    nCoV-2019_7_LEFT    1   +
NC_045512.2 7379    7401    nCoV-2019_7_RIGHT   1   -
NC_045512.2 7298    7328    nCoV-2019_8_LEFT    2   +
NC_045512.2 8363    8385    nCoV-2019_8_RIGHT   2   -
NC_045512.2 8253    8282    nCoV-2019_9_LEFT    1   +
NC_045512.2 9378    9400    nCoV-2019_9_RIGHT   1   -
NC_045512.2 9303    9327    nCoV-2019_10_LEFT   2   +
NC_045512.2 10429   10451   nCoV-2019_10_RIGHT  2   -
NC_045512.2 10343   10370   nCoV-2019_11_LEFT   1   +
NC_045512.2 11447   11469   nCoV-2019_11_RIGHT  1   -
NC_045512.2 11372   11394   nCoV-2019_12_LEFT   2   +
NC_045512.2 12538   12560   nCoV-2019_12_RIGHT  2   -
NC_045512.2 12450   12473   nCoV-2019_13_LEFT   1   +
NC_045512.2 13599   13621   nCoV-2019_13_RIGHT  1   -
NC_045512.2 13509   13532   nCoV-2019_14_LEFT   2   +
NC_045512.2 14619   14641   nCoV-2019_14_RIGHT  2   -
NC_045512.2 14540   14568   nCoV-2019_15_LEFT   1   +
NC_045512.2 15713   15735   nCoV-2019_15_RIGHT  1   -
NC_045512.2 15608   15634   nCoV-2019_16_LEFT   2   +
NC_045512.2 16698   16720   nCoV-2019_16_RIGHT  2   -
NC_045512.2 16624   16647   nCoV-2019_17_LEFT   1   +
NC_045512.2 17732   17754   nCoV-2019_17_RIGHT  1   -
NC_045512.2 17622   17649   nCoV-2019_18_LEFT   2   +
NC_045512.2 18684   18706   nCoV-2019_18_RIGHT  2   -
NC_045512.2 18596   18618   nCoV-2019_19_LEFT   1   +
NC_045512.2 19655   19678   nCoV-2019_19_RIGHT  1   -
NC_045512.2 19574   19604   nCoV-2019_20_LEFT   2   +
NC_045512.2 20676   20698   nCoV-2019_20_RIGHT  2   -
NC_045512.2 20553   20581   nCoV-2019_21_LEFT   1   +
NC_045512.2 21620   21642   nCoV-2019_21_RIGHT  1   -
NC_045512.2 21532   21562   nCoV-2019_22_LEFT   2   +
NC_045512.2 22590   22612   nCoV-2019_22_RIGHT  2   -
NC_045512.2 22511   22537   nCoV-2019_23_LEFT   1   +
NC_045512.2 23609   23631   nCoV-2019_23_RIGHT  1   -
NC_045512.2 23518   23544   nCoV-2019_24_LEFT   2   +
NC_045512.2 24714   24736   nCoV-2019_24_RIGHT  2   -
NC_045512.2 24633   24658   nCoV-2019_25_LEFT   1   +
NC_045512.2 25768   25790   nCoV-2019_25_RIGHT  1   -
NC_045512.2 25690   25712   nCoV-2019_26_LEFT   2   +
NC_045512.2 26835   26857   nCoV-2019_26_RIGHT  2   -
NC_045512.2 26744   26766   nCoV-2019_27_LEFT   1   +
NC_045512.2 27872   27894   nCoV-2019_27_RIGHT  1   -
NC_045512.2 27784   27808   nCoV-2019_28_LEFT   2   +
NC_045512.2 28985   29007   nCoV-2019_28_RIGHT  2   -
NC_045512.2 28677   28699   nCoV-2019_29_LEFT   1   +
NC_045512.2 29768   29790   nCoV-2019_29_RIGHT  1   -