jonathonthill / sangerseqR

git for the Bioconductor package "sangerseqR"
11 stars 10 forks source link

ABIF file with lowercase PBAS.2 field #8

Open project-defiant opened 4 months ago

project-defiant commented 4 months ago

Hello, Thank you for developing and maintaining this package. I have an issue that comes in case of some specific ABIF format files. The format is correctly read by sangerseqR::read.abif but the conversion from abif to sangerseq object is causing empty sequence. The issue is in the lowercase letters in PBAS.2 field.

$PBAS.1
[1] "NNNNNNNNNNNNNNNNNNNNNNNNNNTTCGNTNNNTTAATTNAACATAGACCATCAAGATAATCTGGAACTGACACTTTGATTTTTTCGTCCATTCTGTAACGTCCCACAAACAACTGNNCCACGGNGANGCTNNNNNAANNTCTNTTNNNNNCTTNNNNNNNNTGAAGGNANNTGNNNGANGANNNTNNATGANANTGACNNANANNNANNNNCCNGNNANNTCCTGGTANNNNNTTNNNNNNNNNNNNNTTTNCANTNNNNNNNNNNNANTTTCNNANNNNNNNNNGNTGNTNNCNNNANGANCNNNNANNNNANNNNNNNNCNNGNTANTCNNNNNNNNNNNNNNNNNNNNAn"

$PBAS.2
[1] "nnnnnnnnnnnnnnnnnnnnnnnnnnttcgntnnnttaattnaacatagaccatcaagataatctggaactgacactttgattttttcgtccattctgtaacgtcccacaaacaactgnnccacggngangctnnnnnaanntctnttnnnnncttnnnnnnnntgaaggnanntgnnngangannntnnatganantgacnnanannnannnnccngnnanntcctggtannnnnttnnnnnnnnnnnnntttncantnnnnnnnnnnnantttcnnannnnnnnnngntgntnncnnnangancnnnnannnnannnnnnnncnngntantcnnnnnnnnnnnnnnnnnnnna"

These are the PBAS sequences that are extracted from the sequence file. I could not find much specification on these fields from the apart from that, the PBAS.1 is the edited sequence and PBAS.2 is the raw sequence.

Could you provide some feedback about the assumptions to use the PBAS.2 field and compare it to the DNA_ALPHABET object?https://github.com/jonathonthill/sangerseqR/blob/3664259cb33737daf8d5b3b20c0a9d9a22ce470d/R/sangerseqmethods.R#L39

Best regards Szymon Szyszkowski

project-defiant commented 4 months ago

FYI @jelletenhoeve

jonathonthill commented 3 months ago

Thanks for pointing this out. I must have code requiring uppercase letters somewhere. I will look into submitting a fix.

On May 27, 2024, at 12:56 AM, Szymon Szyszkowski @.***> wrote:

Hello, Thank you for developing and maintaining this package. I have an issue that comes in case of some specific ABIF format files. The format is correctly read by sangerseqR::read.abif but the conversion from abif to sangerseq object is causing empty sequence. The issue is in the lowercase letters in PBAS.2 field.

$PBAS.1 [1] "NNNNNNNNNNNNNNNNNNNNNNNNNNTTCGNTNNNTTAATTNAACATAGACCATCAAGATAATCTGGAACTGACACTTTGATTTTTTCGTCCATTCTGTAACGTCCCACAAACAACTGNNCCACGGNGANGCTNNNNNAANNTCTNTTNNNNNCTTNNNNNNNNTGAAGGNANNTGNNNGANGANNNTNNATGANANTGACNNANANNNANNNNCCNGNNANNTCCTGGTANNNNNTTNNNNNNNNNNNNNTTTNCANTNNNNNNNNNNNANTTTCNNANNNNNNNNNGNTGNTNNCNNNANGANCNNNNANNNNANNNNNNNNCNNGNTANTCNNNNNNNNNNNNNNNNNNNNAn"

$PBAS.2 [1] "nnnnnnnnnnnnnnnnnnnnnnnnnnttcgntnnnttaattnaacatagaccatcaagataatctggaactgacactttgattttttcgtccattctgtaacgtcccacaaacaactgnnccacggngangctnnnnnaanntctnttnnnnncttnnnnnnnntgaaggnanntgnnngangannntnnatganantgacnnanannnannnnccngnnanntcctggtannnnnttnnnnnnnnnnnnntttncantnnnnnnnnnnnantttcnnannnnnnnnngntgntnncnnnangancnnnnannnnannnnnnnncnngntantcnnnnnnnnnnnnnnnnnnnna"

These are the PBAS sequences that are extracted from the sequence file. I could not find much specification on these fields from the apart from that, the PBAS.1 is the edited sequence and PBAS.2 is the raw sequence.

Could you provide some feedback about the assumptions to use the PBAS.2 field and compare it to the DNA_ALPHABET object?https://github.com/jonathonthill/sangerseqR/blob/3664259cb33737daf8d5b3b20c0a9d9a22ce470d/R/sangerseqmethods.R#L39

Best regards Szymon Szyszkowski

— Reply to this email directly, view it on GitHubhttps://github.com/jonathonthill/sangerseqR/issues/8, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABSNKBMFSRS3ZPDNAYZKNQDZELKKVAVCNFSM6AAAAABIKS6M5WVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMYTQMZYGY2TONA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

project-defiant commented 2 months ago

@jonathonthill any updates on this issue? I set up the PR with the fix I assume.