matsengrp / phip-flow

A Nextflow pipeline to align, merge, and organize large PhIP-Seq datasets
MIT License
11 stars 6 forks source link

Low alignment #70

Closed bkellman closed 10 months ago

bkellman commented 10 months ago

I am seeing surprisingly low % mapping. The COVID example gives a % mapping of around 70%, while several experiments we have run give closer to 15%. My colleagues have run these data manually without phipflow and tell me that they have alignment close to 70%, so I suspect I'm messing up some input parameter that is somehow creating an alignment issue. But I can't figure out what that parameter might be.

Any thoughts on what may be causing the low alignment?

Two calls that produced the same result:

nextflow run matsengrp/phip-flow -r V1.10  \
        --sample_table ../data-raw/20231219_samples.csv \
        --peptide_table ../data-raw/peptide_table/VIR3_clean.csv \
        --read_length 75 --peptide_tile_length 168 \
        --run_zscore_fit_predict \
        --run_cpm_enr_workflow \
        --summarize_by_organism true \
        --peptide_org_col Organism \
        --sample_grouping_col sample_name \
        --results 20231219_phipflow_"$(date -I)" \
        -resume

nextflow run matsengrp/phip-flow -r V1.10  \
        --sample_table ../data-raw/20231219_samples.DEBUG.csv \
        --peptide_table ../data-raw/peptide_table/VIR3_clean.csv \
        --read_length 75 --peptide_tile_length 168 \
        --peptide_seq_col Prot \
        --sample_grouping_col sample_name \
        --results 20231219_phipflow_debug_"$(date -I)" \
        -resume

snipette of the peptide table: image

e_id | Aclstr50 | Bclstr50 | Entry | Gene names | Gene ontology (GO) | Gene ontology IDs | Genus | Organism | Protein names | seq | Species | Subcellular location | Version (entry) | Version (sequence) | end | id | oligo | source | Prot_Start | Prot -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 0 | 8652 | 15383 | A0A126 | US5 | suppression by virus of host apoptotic process | GO:0019050 | Cercopithecine herpesvirus 16 (CeHV-16) (Herpesvirus papio 2) | Glycoprotein J | MRSLLFVVGAWVAALVTNLTPDAALASGTTTTAAAGNTSATASPGDNATSIDAGSTITAAAPPGHSTPWPALPTDLALPLVIGGLCALTLAAMGAGALLHRCCRRCARRRQNVSSVSA | Papiine herpesvirus 2 | 13 | 1 | 56 | 1 | aGGAATTCCGCTGCGTatgcgcagcttgctgtttgtggtcggtgcttgggtcgctgctctcgtcaccaaccttacccctgatgcagctcttgcaagtggtactacaaccaccgctgccgcagggaacacatctgcaacagcttctccaggtgacaacgccacaagcatcgacgctggcagtacaCAGGgaagagctcgaa | Vir2 | 1 | MRSLLFVVGAWVAALVTNLTPDAALASGTTTTAAAGNTSATASPGDNATSIDAGST 1 | 15383 | 15384 | A0A126 | US5 | suppression by virus of host apoptotic process | GO:0019050 | Cercopithecine herpesvirus 16 (CeHV-16) (Herpesvirus papio 2) | Glycoprotein J | MRSLLFVVGAWVAALVTNLTPDAALASGTTTTAAAGNTSATASPGDNATSIDAGSTITAAAPPGHSTPWPALPTDLALPLVIGGLCALTLAAMGAGALLHRCCRRCARRRQNVSSVSA | Papiine herpesvirus 2 | 13 | 1 | 84 | 2 | aTGAATTCGGAGCGGTactacaaccaccgctgccgcagggaacacatctgcaacagcttctccaggtgacaacgccacaagcatcgacgctggcagtacaattaccgctgccgctcctccaggtcattcaacaccttggcctgcactcccaactgatctcgcacttccactcgttatcgggggtCACTGCACTCGAGACa | Vir2 | 29 | TTTTAAAGNTSATASPGDNATSIDAGSTITAAAPPGHSTPWPALPTDLALPLVIGG 2 | 15384 | 3254 | A0A126 | US5 | suppression by virus of host apoptotic process | GO:0019050 | Cercopithecine herpesvirus 16 (CeHV-16) (Herpesvirus papio 2) | Glycoprotein J | MRSLLFVVGAWVAALVTNLTPDAALASGTTTTAAAGNTSATASPGDNATSIDAGSTITAAAPPGHSTPWPALPTDLALPLVIGGLCALTLAAMGAGALLHRCCRRCARRRQNVSSVSA | Papiine herpesvirus 2 | 13 | 1 | 112 | 3 | aGGAATTCCGCTGCGTattaccgctgccgctcctccaggtcattcaacaccttggcctgcactcccaactgatctcgcacttccactcgttatcgggggtttgtgcgccctcacactcgcagcaatgggcgccggggcattgcttcatcgctgctgccgccgctgcgcacgccgccgccagaatCAGGgaagagctcgaa | Vir2 | 57 | ITAAAPPGHSTPWPALPTDLALPLVIGGLCALTLAAMGAGALLHRCCRRCARRRQN 3 | 3254 |   | A0A126 | US5 | suppression by virus of host apoptotic process | GO:0019050 | Cercopithecine herpesvirus 16 (CeHV-16) (Herpesvirus papio 2) | Glycoprotein J | MRSLLFVVGAWVAALVTNLTPDAALASGTTTTAAAGNTSATASPGDNATSIDAGSTITAAAPPGHSTPWPALPTDLALPLVIGGLCALTLAAMGAGALLHRCCRRCARRRQNVSSVSA | Papiine herpesvirus 2 | 13 | 1 | 118 | 4 | aTGAATTCGGAGCGGTttgtgcgccctcacactcgcagcaatgggcgccggggcattgcttcatcgctgctgccgccgctgcgcacgccgccgccagaatgtctcctcagtcagtgcttaatctgagctcagtccacgagcagatccgtgtgtgtaggtagattacagcatttctcgcgacggcCACTGCACTCGAGACa | Vir2 | 85 | LCALTLAAMGAGALLHRCCRRCARRRQNVSSVSA 4 | 32531 |   | A0A130 | US4 |   |   |   | Cercopithecine herpesvirus 16 (CeHV-16) (Herpesvirus papio 2) | Glycoprotein G (Fragment) | RDRGPSRSRVRYTRLAASEA | Papiine herpesvirus 2 | 8 | 1 | 20 | 5 | aGGAATTCCGCTGCGTcgcgatcgcggcccttctcgctctcgcgtgcgctacacccgcctggctgcctcagaagcttaatctgagctcagtccacgagcagatccgtgtgtgtaggtagattacagcatttctcgcgacggcgaaacccactaccgtacttgggtcaggtcgatccatgttcctCAGGgaagagctcgaa | Vir2 | 1 | RDRGPSRSRVRYTRLAASEA 5 | 6167 | 2421 | A0A132 | US6 |   |   |   | Cercopithecine herpesvirus 16 (CeHV-16) (Herpesvirus papio 2) | Glycoprotein D (Fragment) | MGFGAAAALLALAVALARVPAGGGAYVPVDRALTRVSPNRFRGSSLPPPEQKTDPPDVRRVYH | Papiine herpesvirus 2 | 11 | 1 | 56 | 6 | aGGAATTCCGCTGCGTatggggtttggcgccgcagcagcactgttggctctggcagttgcactcgcccgcgtgcctgcaggcggcggggcatatgtcccagtggaccgcgcactcacacgcgttagcccaaaccgcttccgcggttcatccctgccacctcctgaacaaaagaccgaccctcctCAGGgaagagctcgaa | Vir2 | 1 | MGFGAAAALLALAVALARVPAGGGAYVPVDRALTRVSPNRFRGSSLPPPEQKTDPP 6 | 2421 |   | A0A132 | US6 |   |   |   | Cercopithecine herpesvirus 16 (CeHV-16) (Herpesvirus papio 2) | Glycoprotein D (Fragment) | MGFGAAAALLALAVALARVPAGGGAYVPVDRALTRVSPNRFRGSSLPPPEQKTDPPDVRRVYH | Papiine herpesvirus 2 | 11 | 1 | 63 | 7 | aTGAATTCGGAGCGGTgtggaccgcgcactcacacgcgttagcccaaaccgcttccgcggttcatccctgccacctcctgaacaaaagaccgaccctcctgacgtccgccgcgtgtatcattaatctgagctcagtccacgagcagatccgtgtgtgtaggtagattacagcatttctcgcgacCACTGCACTCGAGACa | Vir2 | 29 | VDRALTRVSPNRFRGSSLPPPEQKTDPPDVRRVYH 7 | 4099 | 4100 | A0ERK8 | VAC_DPP16_017 VAC_DPP16_226 VAC_TP3_013 VAC_TP5_013 VACV-DUKE-017 VACV_TT12_011 VACV_TT8_011 | Vaccinia virus | VACV-DUKE-017 (Zinc finger-Like protein) (Zinc finger-like) (Zinc finger-like protein) | MHYPKYYINITKINPHLANQFRAWKKRIAGRDYITNLSKDTGIQQSKLTETIRNCQKNRNIYGLYIHYNLVINWITDVIINQY | Vaccinia virus | 10 | 1 | 56 | 8 | aGGAATTCCGCTGCGTatgcactaccctaaatactacatcaacatcacaaaaatcaatcctcaccttgctaaccagtttcgcgcctggaaaaaacgcatcgccggtcgcgattacatcactaaccttagcaaggataccggcatccaacagagcaaacttacagaaaccattcgcaattgtcagCAGGgaagagctcgaa | Vir2 | 1 | MHYPKYYINITKINPHLANQFRAWKKRIAGRDYITNLSKDTGIQQSKLTETIRNCQ 8 | 4100 | 6440 | A0ERK8 | VAC_DPP16_017 VAC_DPP16_226 VAC_TP3_013 VAC_TP5_013 VACV-DUKE-017 VACV_TT12_011 VACV_TT8_011 | Vaccinia virus | VACV-DUKE-017 (Zinc finger-Like protein) (Zinc finger-like) (Zinc finger-like protein) | MHYPKYYINITKINPHLANQFRAWKKRIAGRDYITNLSKDTGIQQSKLTETIRNCQKNRNIYGLYIHYNLVINWITDVIINQY | Vaccinia virus | 10 | 1 | 83 | 9 | aTGAATTCGGAGCGGTgccggtcgcgattacatcactaaccttagcaaggataccggcatccaacagagcaaacttacagaaaccattcgcaattgtcagaagaatcgcaatatctatggcctctacattcattacaatctggtgattaattggatcaccgacgtgattattaaccagtattaaCACTGCACTCGAGACa | Vir2 | 29 | AGRDYITNLSKDTGIQQSKLTETIRNCQKNRNIYGLYIHYNLVINWITDVIINQY 9 | 6500 | 3255 | A0ERN6 | VACV-DUKE-045 |   |   | Vaccinia virus | VACV-DUKE-045 | MGATISILASYDNPNLFTAMILMSPLVNADAVSRLNLLAAKLMGTITPNAPVGKLCPESVSRDMDKVYKYQYDPLINHEKIKAGFASQVLKATNKVRKIISKINTPPTLILQGTNNEISDVLGAYYFMQHANCNREIKIYEGAKHHLHKETDEVKKSVMKEIETWIFNRVK | Vaccinia virus | 4 | 1 | 56 | 10 | aGGAATTCCGCTGCGTatgggcgcaactattagcattttggcctcctatgataaccctaatctcttcaccgccatgattttgatgagtccacttgtcaatgccgatgctgttagtcgcctgaacctcctggcagctaagctgatgggcacaatcacccctaatgcccctgtggggaaactctgcCAGGgaagagctcgaa | Vir2 | 1 | MGATISILASYDNPNLFTAMILMSPLVNADAVSRLNLLAAKLMGTITPNAPVGKLC 10 | 3255 | 3256 | A0ERN6 | VACV-DUKE-045 |   |   | Vaccinia virus | VACV-DUKE-045 | MGATISILASYDNPNLFTAMILMSPLVNADAVSRLNLLAAKLMGTITPNAPVGKLCPESVSRDMDKVYKYQYDPLINHEKIKAGFASQVLKATNKVRKIISKINTPPTLILQGTNNEISDVLGAYYFMQHANCNREIKIYEGAKHHLHKETDEVKKSVMKEIETWIFNRVK | Vaccinia virus | 4 | 1 | 84 | 11 | aTGAATTCGGAGCGGTgccgatgctgttagtcgcctgaacctcctggcagctaagctgatgggcacaatcacccctaatgcccctgtggggaaactctgccctgaatccgtctcacgcgatatggacaaggtttataagtatcagtacgaccctttgatcaatcacgagaaaatcaaagcagggCACTGCACTCGAGACa | Vir2 | 29 | ADAVSRLNLLAAKLMGTITPNAPVGKLCPESVSRDMDKVYKYQYDPLINHEKIKAG 11 | 3256 | 3093 | A0ERN6 | VACV-DUKE-045 |   |   | Vaccinia virus | VACV-DUKE-045 | MGATISILASYDNPNLFTAMILMSPLVNADAVSRLNLLAAKLMGTITPNAPVGKLCPESVSRDMDKVYKYQYDPLINHEKIKAGFASQVLKATNKVRKIISKINTPPTLILQGTNNEISDVLGAYYFMQHANCNREIKIYEGAKHHLHKETDEVKKSVMKEIETWIFNRVK | Vaccinia virus | 4 | 1 | 112 | 12 | aGGAATTCCGCTGCGTcctgaatccgtctcacgcgatatggacaaggtttataagtatcagtacgaccctttgatcaatcacgagaaaatcaaagcagggtttgcctcacaagttcttaaagctactaacaaagtgcgcaagattatctccaagattaatacaccacctaccttgatcctgcagCAGGgaagagctcgaa | Vir2 | 57 | PESVSRDMDKVYKYQYDPLINHEKIKAGFASQVLKATNKVRKIISKINTPPTLILQ 12 | 3093 | 9021 | A0ERN6 | VACV-DUKE-045 |   |   | Vaccinia virus | VACV-DUKE-045 | MGATISILASYDNPNLFTAMILMSPLVNADAVSRLNLLAAKLMGTITPNAPVGKLCPESVSRDMDKVYKYQYDPLINHEKIKAGFASQVLKATNKVRKIISKINTPPTLILQGTNNEISDVLGAYYFMQHANCNREIKIYEGAKHHLHKETDEVKKSVMKEIETWIFNRVK | Vaccinia virus | 4 | 1 | 140 | 13 | aTGAATTCGGAGCGGTtttgcctcacaagttcttaaagctactaacaaagtgcgcaagattatctccaagattaatacaccacctaccttgatcctgcaggggaccaataatgaaatcagtgacgttctcggcgcatactactttatgcaacacgctaactgcaatcgcgagatcaaaatctacCACTGCACTCGAGACa | Vir2 | 85 | FASQVLKATNKVRKIISKINTPPTLILQGTNNEISDVLGAYYFMQHANCNREIKIY 13 | 9021 | 14672 | A0ERN6 | VACV-DUKE-045 |   |   | Vaccinia virus | VACV-DUKE-045 | MGATISILASYDNPNLFTAMILMSPLVNADAVSRLNLLAAKLMGTITPNAPVGKLCPESVSRDMDKVYKYQYDPLINHEKIKAGFASQVLKATNKVRKIISKINTPPTLILQGTNNEISDVLGAYYFMQHANCNREIKIYEGAKHHLHKETDEVKKSVMKEIETWIFNRVK | Vaccinia virus | 4 | 1 | 168 | 14 | aGGAATTCCGCTGCGTgggaccaataatgaaatcagtgacgttctcggcgcatactactttatgcaacacgctaactgcaatcgcgagatcaaaatctacgaaggtgccaaacatcacctgcataaggagactgacgaagtcaagaaatctgtgatgaaggagatcgaaacttggattttcaatCAGGgaagagctcgaa | Vir2 | 113 | GTNNEISDVLGAYYFMQHANCNREIKIYEGAKHHLHKETDEVKKSVMKEIETWIFN 14 | 14672 |   | A0ERN6 | VACV-DUKE-045 |   |   | Vaccinia virus | VACV-DUKE-045 | MGATISILASYDNPNLFTAMILMSPLVNADAVSRLNLLAAKLMGTITPNAPVGKLCPESVSRDMDKVYKYQYDPLINHEKIKAGFASQVLKATNKVRKIISKINTPPTLILQGTNNEISDVLGAYYFMQHANCNREIKIYEGAKHHLHKETDEVKKSVMKEIETWIFNRVK | Vaccinia virus | 4 | 1 | 171 | 15 | aTGAATTCGGAGCGGTgaaggtgccaaacatcacctgcataaggagactgacgaagtcaagaaatctgtgatgaaggagatcgaaacttggattttcaatcgcgttaaataatctgagctcagtccacgagcagatccgtgtgtgtaggtagattacagcatttctcgcgacggcgaaacccacCACTGCACTCGAGACa | Vir2 | 141 | EGAKHHLHKETDEVKKSVMKEIETWIFNRVK 15 | 8653 | 4310 | A0ES12 | VACV-DUKE-171 |   |   | Vaccinia virus | VACV-DUKE-171 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDGRSYR | Vaccinia virus | 17 | 1 | 56 | 16 | aGGAATTCCGCTGCGTatgattccattgttgtttatcttgttctattttgctaacgggatcgaatggcataaatttgaaacttccgaagagattattagcacatatttgcttgatgacgttctttataccggcgtgaacggcgctgtctacactttttcaaataacaagcttaataagacaggcCAGGgaagagctcgaa | Vir2 | 1 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTG 16 | 4310 | 4311 | A0ES12 | VACV-DUKE-171 |   |   | Vaccinia virus | VACV-DUKE-171 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDGRSYR | Vaccinia virus | 17 | 1 | 84 | 17 | aTGAATTCGGAGCGGTacatatttgcttgatgacgttctttataccggcgtgaacggcgctgtctacactttttcaaataacaagcttaataagacaggcttgacaaacactaattatattaccacatcaatcaaggttgaagacgccgataaagatactcttgtctgcggcaccaacaacggtCACTGCACTCGAGACa | Vir2 | 29 | TYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNG 17 | 4311 | 6501 | A0ES12 | VACV-DUKE-171 |   |   | Vaccinia virus | VACV-DUKE-171 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDGRSYR | Vaccinia virus | 17 | 1 | 112 | 18 | aGGAATTCCGCTGCGTttgacaaacactaattatattaccacatcaatcaaggttgaagacgccgataaagatactcttgtctgcggcaccaacaacggtaaccctaaatgctggaaaatcgacggctcagatgatcctaaacaccgcgggcgcggttacgccccatatcagaactccaaagtcCAGGgaagagctcgaa | Vir2 | 57 | LTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKV 18 | 6501 | 6502 | A0ES12 | VACV-DUKE-171 |   |   | Vaccinia virus | VACV-DUKE-171 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDGRSYR | Vaccinia virus | 17 | 1 | 140 | 19 | aTGAATTCGGAGCGGTaaccctaaatgctggaaaatcgacggctcagatgatcctaaacaccgcgggcgcggttacgccccatatcagaactccaaagtcacaatcatcagtcacaacggttgtgtcttgtcagatattaatatcagtaaggagggtatcaaacgctggcgccgctttgacgggCACTGCACTCGAGACa | Vir2 | 85 | NPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDG 19 | 6502 | 6168 | A0ES12 | VACV-DUKE-171 |   |   | Vaccinia virus | VACV-DUKE-171 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDGRSYR | Vaccinia virus | 17 | 1 | 168 | 20 | aGGAATTCCGCTGCGTacaatcatcagtcacaacggttgtgtcttgtcagatattaatatcagtaaggagggtatcaaacgctggcgccgctttgacgggccttgtggctatgacctctttactgcagacaacgtcatcccaaaagatggcttgcgcggggcttttgttgataaggatggcacaCAGGgaagagctcgaa | Vir2 | 113 | TIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGT 20 | 6168 | 4312 | A0ES12 | VACV-DUKE-171 |   |   | Vaccinia virus | VACV-DUKE-171 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDGRSYR | Vaccinia virus | 17 | 1 | 196 | 21 | aTGAATTCGGAGCGGTccttgtggctatgacctctttactgcagacaacgtcatcccaaaagatggcttgcgcggggcttttgttgataaggatggcacatacgataaagtgtatatcttgttcaccgataccattggttctaagcgcatcgtgaagatcccttatatcacccagatgtgccttCACTGCACTCGAGACa | Vir2 | 141 | PCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCL 21 | 4312 | 4313 | A0ES12 | VACV-DUKE-171 |   |   | Vaccinia virus | VACV-DUKE-171 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDGRSYR | Vaccinia virus | 17 | 1 | 224 | 22 | aGGAATTCCGCTGCGTtacgataaagtgtatatcttgttcaccgataccattggttctaagcgcatcgtgaagatcccttatatcacccagatgtgccttaatgatgaagggggcccttcaagtttgagttctcaccgctggtccacttttctcaaagtcgagcttgaatgtgacatcgacggtCAGGgaagagctcgaa | Vir2 | 169 | YDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDG 22 | 4313 |   | A0ES12 | VACV-DUKE-171 |   |   | Vaccinia virus | VACV-DUKE-171 | MIPLLFILFYFANGIEWHKFETSEEIISTYLLDDVLYTGVNGAVYTFSNNKLNKTGLTNTNYITTSIKVEDADKDTLVCGTNNGNPKCWKIDGSDDPKHRGRGYAPYQNSKVTIISHNGCVLSDINISKEGIKRWRRFDGPCGYDLFTADNVIPKDGLRGAFVDKDGTYDKVYILFTDTIGSKRIVKIPYITQMCLNDEGGPSSLSSHRWSTFLKVELECDIDGRSYR | Vaccinia virus | 17 | 1 | 228 | 23 | aTGAATTCGGAGCGGTaatgatgaagggggcccttcaagtttgagttctcaccgctggtccacttttctcaaagtcgagcttgaatgtgacatcgacggtcgctcataccgctaatctgagctcagtccacgagcagatccgtgtgtgtaggtagattacagcatttctcgcgacggcgaaaccCACTGCACTCGAGACa | Vir2 | 197 | NDEGGPSSLSSHRWSTFLKVELECDIDGRSYR 23 | 14673 | 4314 | A0ES13 | VACV-DUKE-172 |   |   | Vaccinia virus | VACV-DUKE-172 | MNTIKQSFSTSKLEGYTKQLPSPAPGICLPAGKVVPHTTFEVIEQYNVLDDIIKPLSNQPIFEGPSGVKWFDIKEKENEHREYRIYFIKENSIYSFDTKSKQTRSSQVDARLFSVMVTSKPLFIADIGIGVGMPQMKKILKM | Vaccinia virus | 17 | 1 | 56 | 24 | aGGAATTCCGCTGCGTatgaacactatcaaacagtcattctcaacatctaaactcgaagggtataccaaacagctcccaagtccagcaccagggatttgcttgccagcaggtaaggtggtgcctcacacaacattcgaggttatcgaacaatacaatgtcttggatgatattattaaacctctgCAGGgaagagctcgaa | Vir2 | 1 | MNTIKQSFSTSKLEGYTKQLPSPAPGICLPAGKVVPHTTFEVIEQYNVLDDIIKPL 24 | 4314 | 4315 | A0ES13 | VACV-DUKE-172 |   |   | Vaccinia virus | VACV-DUKE-172 | MNTIKQSFSTSKLEGYTKQLPSPAPGICLPAGKVVPHTTFEVIEQYNVLDDIIKPLSNQPIFEGPSGVKWFDIKEKENEHREYRIYFIKENSIYSFDTKSKQTRSSQVDARLFSVMVTSKPLFIADIGIGVGMPQMKKILKM | Vaccinia virus | 17 | 1 | 84 | 25 | aTGAATTCGGAGCGGTttgccagcaggtaaggtggtgcctcacacaacattcgaggttatcgaacaatacaatgtcttggatgatattattaaacctctgtccaatcaacctattttcgaaggtccatctggcgttaagtggttcgacatcaaggaaaaggaaaatgagcatcgcgagtaccgcCACTGCACTCGAGACa | Vir2 | 29 | LPAGKVVPHTTFEVIEQYNVLDDIIKPLSNQPIFEGPSGVKWFDIKEKENEHREYR 25 | 4315 | 4316 | A0ES13 | VACV-DUKE-172 |   |   | Vaccinia virus | VACV-DUKE-172 | MNTIKQSFSTSKLEGYTKQLPSPAPGICLPAGKVVPHTTFEVIEQYNVLDDIIKPLSNQPIFEGPSGVKWFDIKEKENEHREYRIYFIKENSIYSFDTKSKQTRSSQVDARLFSVMVTSKPLFIADIGIGVGMPQMKKILKM | Vaccinia virus | 17 | 1 | 112 | 26 | aGGAATTCCGCTGCGTtccaatcaacctattttcgaaggtccatctggcgttaagtggttcgacatcaaggaaaaggaaaatgagcatcgcgagtaccgcatctacttcatcaaagaaaatagtatctacagcttcgacaccaaatctaagcaaacacgcagtagccaagtcgatgcacgccttCAGGgaagagctcgaa | Vir2 | 57 | SNQPIFEGPSGVKWFDIKEKENEHREYRIYFIKENSIYSFDTKSKQTRSSQVDARL
jgallowa07 commented 10 months ago

Hello @bkellman - sorry for the confusion, and late reply here.

I'm not sure I know the exact reason for low alignment, but I have a good idea. The peptide table documentation notes clearly

Currently, only upper case oligonucleotides will be included as part of the reference index when aligning the reads. Historically, we have encoded the barcodes with lower case letters.

So I have a hunch that's your main issue. But it's also noteworthy that your --peptide_tile_length (changed to --oligo_tile_length in V1.12) is longer than your read length. That doesn't make so much sense unless you're only looking for partial alignments. We use the more convential practice of end-to-end alignment, meaning the alignments should be found within the 5' end of the read.

I'll close this as it's not necessarily a bug - but feel free to re-open or submit a new issue if you find behavior that contradicts what's in the detailed description of the alignment approach, and that I can reproduce on my end.

Thanks!

bkellman commented 9 months ago

I spoke with my bench colleague, they said that their understanding from the original elledge protocol was that we should be sequencing the first 75nt of a 168nt insert (60aa) then align only the first 50nt. To describe that, I was thinking --read_length 50 --peptide_tile_length 168 would be appropriate. Does that make sense? We could set readlength to zero but that seems odd, no?

Could we maybe find some time to chat? It seems like we have some different assumptions.

jgallowa07 commented 9 months ago

The alignment script is linked in the docs - looking at that might clarify things. Note that the oligo_tile_length parameter is used to set the "seed length" in bowtie's -n alignment mode.

the original elledge protocol was that we should be sequencing the first 75nt of a 168nt insert (60aa) then align only the first 50nt.

It would be good to specify which paper you're referring to and point to specific text but I assume you're referring to the 2015 virscan paper, as the "original" 2011 paper was pre-virscan. Either way they point to the same methods from the 2011 paper in which they state:

"The reference sequences were truncated to the length of the reads and alignment was constrained to the appropriate strand."

So If your library was designed for peptides of length 60aa, and you want to perform strait-forward, end-to-end alignment of reads to the first 50 nts that encode for that peptide, then you could trim your reads, as well as the sequences in the "oligo" column, to the first 50 nts. Then you would specify both oligo_tile_length and read_length to 50.

However, trimming is actually not really necessary because we use the -n alignment mode from bowtie. Again, the oligo_tile_length is used for the "seed" parameter (-l in bowtie), so even if you don't trim, the policy is (as stated by the bowtie docs).

"Alignments may have no more than N mismatches (where N is a number 0-3, set with -n) in the first L bases (where L is a number 5 or greater, set with -l) on the high-quality (left) end of the read. The first L bases are called the “seed”."

Confusingly, this means that oligo_tile_length can be thought of as the effective length (50 in you case) of the peptide-encoding oligo for alignment purposes only (otherwise why would the pipeline care how long the full tile is?).

We could set read length to zero but that seems odd, no?

from the docs:

Ideally, the --read_length is the same as the length specified by the --oligo_tile_length parameter. If the read_length is greater than the oligo_tile_length, we use bowtie’s --trim3 parameter to trim the reads on the 3’ end to match the --oligo_tile_length.

However, If the read_length parameter is set to be less than oligo_tile_length then it's essentially ignored as a parameter. Thus, 0 is perfectly reasonable for partial alignments. I realize these are indeed confusing parameter names in this case, but the pipeline was originally designed such that we had longer reads than oligos in our libraries - in which case the parameter names make sense to me.

Could we maybe find some time to chat? It seems like we have some different assumptions.

I'm happy to hop on a call - but am quite busy working on other projects so next week is probably the soonest I could do. Before scheduling a call, I would appreciate if you could try trimming the adapters from the "oligo" column (as discussed in my last response), and setting the oligo_tile_length to 50.

bkellman commented 9 months ago

@jgallowa07 thanks for your help. Changing the tile tile_length and read_length had not effect but changing the 5' and total trim using trimmomatic resolved the issue nicely. In the figure below, is used 8 test samples (including 2 or 3 bead-only controls) to test n-mismatch (2-3), read_length (0-50), and oligo_tile_length (50-1680), -5 trim (headcrop 0-25), final read length (crop 50-75)). It looks to me that the primary impact is the cropping. This is consistent with what you said but the parameters we discussed don't appear to impact the alignment as expected. Would you be open to you or I adding a trimmomatic module?

image